Identifying data in a data processing system

ABSTRACT

In a data processing system, a mechanism identifies data items by substantially unique identifiers which depend on all of the data in the data items and only on the data in the data items. Existence means determine whether a particular data item is present in the system, by examining the identifiers of the plurality of data items.

BACKGROUND OF THE INVENTION

[0001] 1. Field of the Invention

[0002] This invention relates to data processing systems and, moreparticularly, to data processing systems wherein data items areidentified by substantially unique identifiers which depend on all ofthe data in the data items and only on the data in the data items.

[0003] 2. Background of the Invention

[0004] Data processing (DP) systems, computers, networks of computers,or the like, typically offer users and programs various ways to identifythe data in the systems.

[0005] Users typically identify data in the data processing system bygiving the data some form of name. For example, a typical operatingsystem (OS) on a computer provides a file system in which data items arenamed by alphanumeric identifiers. Programs typically identify data inthe data processing system using a location or address. For example, aprogram may identify a record in a file or database by using a recordnumber which serves to locate that record.

[0006] In all but the most primitive operating systems, users andprograms are able to create and use collections of named data items,these collections themselves being named by identifiers. These namedcollections can then, themselves, be made part of other namedcollections. For example, an OS may provide mechanisms to group files(data items) into directories (collections). These directories can then,themselves be made part of other directories. A data item may thus beidentified relative to these nested directories using a sequence ofnames, or a so-called pathname, which defines a path through thedirectories to a particular data item (file or directory).

[0007] As another example, a database management system may group datarecords (data items) into tables and then group these tables intodatabase files (collections). The complete address of any data recordcan then be specified using the database file name, the table name, andthe record number of that data record.

[0008] Other examples of identifying data items include: identifyingfiles in a network file system, identifying objects in anobject-oriented database, identifying images in an image database, andidentifying articles in a text database.

[0009] In general, the terms “data” and “data item” as used herein referto sequences of bits. Thus a data item may be the contents of a file, aportion of a file, a page in memory, an object in an object-orientedprogram, a digital message, a digital scanned image, a part of a videoor audio signal, or any other entity which can be represented by asequence of bits. The term “data processing” herein refers to theprocessing of data items, and is sometimes dependent on the type of dataitem being processed. For example, a data processor for a digital imagemay differ from a data processor for an audio signal.

[0010] In all of the prior data processing systems the names oridentifiers provided to identify data items (the data items being files,directories, records in the database, objects in object-orientedprogramming, locations in memory or on a physical device, or the like)are always defined relative to a specific context. For instance, thefile identified by a particular file name can only be determined whenthe directory containing the file (the context) is known. The fileidentified by a pathname can be determined only when the file system(context) is known. Similarly, the addresses in a process address space,the keys in a database table, or domain names on a global computernetwork such as the Internet are meaningful only because they arespecified relative to a context.

[0011] In prior art systems for identifying data items there is nodirect relationship between the data names and the data item. The samedata name in two different contexts may refer to different data items,and two different data names in the same context may refer to the samedata item.

[0012] In addition, because there is no correlation between a data nameand the data it refers to, there is no a priori way to confirm that agiven data item is in fact the one named by a data name. For instance,in a DP system, if one processor requests that another processor delivera data item with a given data name, the requesting processor cannot, ingeneral, verify that the data delivered is the correct data (given onlythe name). Therefore it may require further processing, typically on thepart of the requester, to verify that the data item it has obtained is,in fact, the item it requested.

[0013] A common operation in a DP system is adding a new data item tothe system. When a new data item is added to the system, a name can beassigned to it only by updating the context in which names are defined.Thus such systems require a centralized mechanism for the management ofnames. Such a mechanism is required even in a multi-processing systemwhen data items are created and identified at separate processors indistinct locations, and in which there is no other need forcommunication when data items are added.

[0014] In many data processing systems or environments, data items aretransferred between different locations in the system. These locationsmay be processors in the data processing system, storage devices,memory, or the like. For example, one processor may obtain a data itemfrom another processor or from an external storage device, such as afloppy disk, and may incorporate that data item into its system (usingthe name provided with that data item).

[0015] However, when a processor (or some location) obtains a data itemfrom another location in the DP system, it is possible that thisobtained data item is already present in the system (either at thelocation of the processor or at some other location accessible by theprocessor) and therefore a duplicate of the data item is created. Thissituation is common in a network data processing environment whereproprietary software products are installed from floppy disks ontoseveral processors sharing a common file server. In these systems, it isoften the case that the same product will be installed on severalsystems, so that several copies of each file will reside on the commonfile server.

[0016] In some data processing systems in which several processors areconnected in a network, one system is designated as a cache server tomaintain master copies of data items, and other systems are designatedas cache clients to copy local copies of the master data items into alocal cache on an as-needed basis. Before using a cached item, a cacheclient must-either reload the cached item, be informed of changes to thecached item, or confirm that the master item corresponding to the cacheditem has not changed. In other words, a cache client must synchronizeits data items with those on the cache server. This synchronization mayinvolve reloading data items onto the cache client. The need to keep thecache synchronized or reload it adds significant overhead to existingcaching mechanisms.

[0017] In view of the above and other problems with prior art systems,it is therefore desirable to have a mechanism which allows eachprocessor in a multiprocessor system to determine a common andsubstantially unique identifier for a data item, using only the data inthe data item and not relying on any sort of context.

[0018] It is further desirable to have a mechanism for reducing multiplecopies of data items in a data processing system and to have a mechanismwhich enables the identification of identical data items so as to reducemultiple copies. It is further desirable to determine whether twoinstances of a data item are in fact the same data item, and to performvarious other systems' functions and applications on data items withoutrelying on any context information or properties of the data item.

[0019] It is also desirable to provide such a mechanism in such a way asto make it transparent to users of the data processing system, and it isdesirable that a single mechanism be used to address each of theproblems described above.

SUMMARY OF THE INVENTION

[0020] This invention provides, in a data processing system, a methodand apparatus for identifying a data item in the system, where theidentity of the data item depends on all of the data in the data itemand only on the data in the data item. Thus the identity of a data itemis independent of its name, origin, location, address, or otherinformation not derivable directly from the data, and depends only onthe data itself.

[0021] This invention further provides an apparatus and a method fordetermining whether a particular data item is present in the system orat a location in the system, by examining only the data identities of aplurality of data items.

[0022] Using the method or apparatus of the present invention, theefficiency and integrity of a data processing system can be improved.The present invention improves the design and operation of a datastorage system, file system, relational database, object-orienteddatabase, or the like that stores a plurality of data items, by makingpossible or improving the design and operation of at least some or allof the following features:

[0023] the system stores at most one copy of any data item at a givenlocation, even when multiple data names in the system refer to the samecontents;

[0024] the system avoids copying data from source to destinationlocations when the destination locations already have the data;

[0025] the system provides transparent access to any data item byreference only to its identity and independent of its present location,whether it be local, remote, or offline;

[0026] the system caches data items from a server, so that only the mostrecently accessed data items need be retained;

[0027] when the system is being used to cache data items, problems ofmaintaining cache consistency are avoided;

[0028] the system maintains a desired level of redundancy of data itemsin a network of servers, to protect against failure by ensuring thatmultiple copies of the data items are present at different locations inthe system;

[0029] the system automatically archives data items as they are createdor modified;

[0030] the system provides the size, age, and location of groups of dataitems in order to decide whether they can be safely removed from a localfile system;

[0031] the system can efficiently record and preserve any collection ofdata items;

[0032] the system can efficiently make a copy of any collection of dataitems, to support a version control mechanism for groups of the dataitems;

[0033] the system can publish data items, allowing other, possiblyanonymous, systems in a network to gain access to the data items and torely on the availability of the data items;

[0034] the system can maintain a local inventory of all the data itemslocated on a given removable medium, such as a diskette or CD-ROM, theinventory is independent of other properties of the data items such astheir name, location, and date of creation;

[0035] the system allows closely related sets of data items, such asmatching or corresponding directories on disconnected computers, to beperiodically resynchronized with one another;

[0036] the system can verify that data retrieved from another locationis the desired or requested data, using only the data identifier used toretrieve the data;

[0037] the system can prove possession of specific data items by contentwithout disclosing the content of the data items, for purposes of laterlegal verification and to provide anonymity;

[0038] the system tracks possession of specific data items according tocontent by owner, independent of the name, date, or other properties ofthe data item, and tracks the uses of specific data items and files bycontent for accounting purposes.

[0039] Other objects, features, and characteristics of the presentinvention as well as the methods of operation and functions of therelated elements of structure, and the combination of parts andeconomies of manufacture, will become more apparent upon considerationof the following description and the appended claims with reference tothe accompanying-drawings, all of which form a part of thisspecification.

BRIEF DESCRIPTION OF THE DRAWINGS

[0040]FIG. 1 depicts a typical data processing system in which apreferred embodiment of the present invention operates;

[0041]FIG. 2 depicts a hierarchy of data items stored at any location insuch a data processing system;

[0042] FIGS. 3-9 depict data structures used to implement an embodimentof the present invention; and

[0043] FIGS. 10(a)-28 are flow charts depicting operation of variousaspects of the present invention.

DETAILED DESCRIPTION OF THE PRESENTLY PREFERRED EXEMPLARY EMBODIMENTS

[0044] An embodiment of the present invention is now described withreference to a typical data processing system 100, which, with referenceto FIG. 1, includes one or more processors (or computers) 102 andvarious storage devices 104 connected in some way, for example by a bus106.

[0045] Each processor 102 includes a CPU 108, a memory 110 and one ormore local storage devices 112. The CPU 108, memory 110, and localstorage device 112 may be internally connected, for example by a bus114. Each processor 102 may also include other devices (not shown), suchas a keyboard, a display, a printer, and the like.

[0046] In a data processing system 100, wherein more than one processor102 is used, that is, in a multiprocessor system, the processors may bein one of various relationships. For example, two processors 102 may bein a client/server, client/client, or a server/server relationship.These inter-processor relationships may be dynamic, changing dependingon particular situations and functions. Thus, a particular processor 102may change its relationship to other processors as needed, essentiallysetting up a peer-to-peer relationship with other processors. In apeer-to-peer relationship, sometimes a particular processor 102 acts asa client processor, whereas at other times the same processor acts as aserver processor. In other words, there is no hierarchy imposed on orrequired of processors 102.

[0047] In a multiprocessor system, the processors 102 may be homogeneousor heterogeneous. Further, in a multiprocessor data processing system100, some or all of the processors 102 may be disconnected from thenetwork of processors for periods of time. Such disconnection may bepart of the normal operation of the system 100 or it may be because aparticular processor 102 is in need of repair.

[0048] Within a data processing system 100, the data may be organized toform a hierarchy of data storage elements, wherein lower level datastorage elements are combined to form higher level elements. Thishierarchy can consist of, for example, processors, file systems,regions, directories, data files, segments, and the like. For example,with reference to FIG. 2, the data items on a particular processor 102may be organized or structured as a file system 116 which comprisesregions 117, each of which comprises directories 118, each of which cancontain other directories 118 or files 120. Each file 120 being made upof one or more data segments 122.

[0049] In a typical data processing system, some or all of theseelements can be named by users given certain implementation specificnaming conventions, the name (or pathname) of an element being relativeto a context. In the context of a data processing system 100, a pathnameis fully specified by a processor name, a file system name, a sequenceof zero or more directory names identifying nested directories, and afinal file name. (Usually the lowest level elements, in this casesegments 122, cannot be named by users.)

[0050] In other words, a file system 116 is a collection of directories118. A directory 118 is a collection of named files 120—both data files120 and other directory files 118. A file 120 is a named data item whichis either a data file (which may be simple or compound) or a directoryfile 118. A simple file 120 consists of a single data segment 122. Acompound file 120 consists of a sequence of data segments 122. A datasegment 122 is a fixed sequence of bytes. An important property of anydata segment is its size, the number of bytes in the sequence.

[0051] A single processor 102 may access one or more file systems 116,and a single storage device 104 may contain one or more file systems116, or portions of a file system 116. For instance, a file system 116may span several storage devices 104.

[0052] In order to implement controls in a file system, file system 116may be divided into distinct regions, where each region is a unit ofmanagement and control. A region consists of a given directory 118 andis identified by the pathname (user defined) of the directory.

[0053] In the following, the term “location”, with respect to a dataprocessing system 100, refers to any of a particular processor 102 inthe system, a memory of a particular processor, a storage device, aremovable storage medium (such as a floppy disk or compact disk), or anyother physical location in the system. The term “local” with respect toa particular processor 102 refers to the memory and storage devices ofthat particular processor.

[0054] In the following, the terms “True Name”, “data identity” and“data identifier” refer to the substantially unique data identifier fora particular data item. The term “True File” refers to the actual file,segment, or data item identified by a True Name.

[0055] A file system for a data processing system 100 is now describedwhich is intended to work with an existing operating system byaugmenting some of the operating system's file management system codes.The embodiment provided relies on the standard file managementprimitives for actually storing to and retrieving data items from disk,but uses the mechanisms of the present invention to reference and accessthose data items.

[0056] The processes and mechanisms (services) provided in thisembodiment are grouped into the following categories: primitivemechanisms, operating system mechanisms, remote mechanisms, backgroundmechanisms, and extended mechanisms.

[0057] Primitive mechanisms provide fundamental capabilities used tosupport other mechanisms. The following primitive mechanisms aredescribed:

[0058] 1. Calculate True Name;

[0059] 2. Assimilate Data Item;

[0060] 3. New True File;

[0061] 4. Get True Name from Path;

[0062] 5. Link path to True Name;

[0063] 6. Realize True File from Location;

[0064] 7. Locate Remote File;

[0065] 8. Make True File Local;

[0066] 9. Create Scratch File;

[0067] 10. Freeze Directory;

[0068] 11. Expand Frozen Directory;

[0069] 12. Delete True File;

[0070] 13. Process Audit File Entry;

[0071] 14. Begin Grooming;

[0072] 15. Select For Removal; and

[0073] 16. End Grooming.

[0074] Operating system mechanisms provide typical familiar file systemmechanisms, while maintaining the data structures required to offer themechanisms of the present invention. Operating system mechanisms aredesigned to augment existing operating systems, and in this way to makethe present invention compatible with, and generally transparent to,existing applications. The following operating system mechanisms aredescribed:

[0075] 1. Open File;

[0076] 2. Close File;

[0077] 3. Read File;

[0078] 4. Write File;

[0079] 5. Delete File or Directory;

[0080] 6. Copy File or Directory;

[0081] 7. Move File or Directory;

[0082] 8. Get File Status; and

[0083] 9. Get Files in Directory.

[0084] Remote mechanisms are used by the operating system in respondingto requests from other processors. These mechanisms enable thecapabilities of the present invention in a peer-to-peer network mode ofoperation. The following remote mechanisms are described:

[0085] 1. Locate True File;

[0086] 2. Reserve True File;

[0087] 3. Request True File;

[0088] 4. Retire True File;

[0089] 5. Cancel Reservation;

[0090] 6. Acquire True File;

[0091] 7. Lock Cache;

[0092] 8. Update Cache; and

[0093] 9. Check Expiration Date.

[0094] Background mechanisms are intended to run occasionally and at alow priority. These provide automated management capabilities withrespect to the present invention. The following background mechanismsare described:

[0095] 1. Mirror True File;

[0096] 2. Groom Region;

[0097] 3. Check for Expired Links; and

[0098] 4. Verify Region; and

[0099] 5. Groom Source List.

[0100] Extended mechanisms run within application programs over theoperating system. These mechanisms provide solutions to specificproblems and applications. The following extended mechanisms aredescribed:

[0101] 1. Inventory Existing Directory;

[0102] 2. Inventory Removable, Read-only Files;

[0103] 3. Synchronize directories;

[0104] 4. Publish Region;

[0105] 5. Retire Directory;

[0106] 6. Realize Directory at location;

[0107] 7. Verify True File;

[0108] 8. Track for accounting purposes; and

[0109] 9. Track for licensing purposes.

[0110] The file system herein described maintains sufficient informationto provide a variety of mechanisms not ordinarily offered by anoperating system, some of which are listed and described here. Variousprocessing performed by this embodiment of the present invention willnow be described in greater detail.

[0111] In some embodiments, some files 120 in a data processing system100 do not have True Names because they have been recently received orcreated or modified, and thus their True Names have not yet beencomputed. A file that does not yet have a True Name is called a scratchfile. The process of assigning a True Name to a file is referred to asassimilation, and is described later. Note that a scratch file may havea user provided name.

[0112] Some of the processing performed by the present invention cantake place in a background mode or on a delayed or as-needed basis. Thisbackground processing is used to determine information that is notimmediately required by the system or which may never be required. As anexample, in some cases a scratch file is being changed at a rate greaterthan the rate at which it is useful to determine its True Name. In thesecases, determining the True Name of the file can be postponed orperformed in the background.

[0113] Data Structures

[0114] The following data structures, stored in memory 110 of one ofmore processors 102 are used to implement the mechanisms describedherein. The data structures can be local to each processor 102 of thesystem 100, or they can reside on only some of the processors 102.

[0115] The data structures described are assumed to reside on individualpeer processors 102 in the data processing system 100. However, they canalso be shared by placing them on a remote, shared file server (forinstance, in a local area network of machines). In order to accommodatesharing data structures., it is necessary that the processors accessingthe shared database use the appropriate locking techniques to ensurethat changes to the shared database do not interfere with one anotherbut are appropriately serialized. These locking techniques are wellunderstood by ordinarily skilled programmers of distributedapplications.

[0116] It is sometimes desirable to allow some regions to be local to aparticular processor 102 and other regions to be shared among processors102. (Recall that a region is a unit of file system management andcontrol consisting of a given directory identified by the pathname ofthe directory.) In the case of local and shared regions, there would beboth local and shared versions of each data structure. Simple changes tothe processes described below must be made to ensure that appropriatedata structures are selected for a given operation.

[0117] The local directory extensions (LDE) table 124 is a datastructure which provides information about files 120 and directories 118in the data processing system 100. The local directory extensions table124 is indexed by a pathname or contextual name (that is, a userprovided name) of a file and includes the True Name for most files. Theinformation in local directory extension table 124 is in addition tothat provided by the native file system of the operating system.

[0118] The True File registry (TFR) 126 is a data store for listingactual data items which have True Names, both files 120 and segments122. When such data items occur in the True File registry 126 they areknown as True Files. True Files are identified in True File registry 126by their True Names or identities. The table True File registry 126 alsostores location, dependency, and migration information about True Files.

[0119] The region table (RT) 128 defines areas in the network storagewhich are to be managed separately. Region table 128 defines the rulesfor access to and migration of files 120 among various regions with thelocal file system 116 and remote peer file systems.

[0120] The source table (ST) 130 is a list of the sources of True Filesother than the current True File registry 126. The source table 130includes removable volumes and remote processors.

[0121] The audit file (AF) 132 is a list of records indicating changesto be made in local or remote files, these changes to be processed inbackground.

[0122] The accounting log (AL) 134 is a log of file transactions used tocreate accounting information in a manner which preserves the identityof files being tracked independent of their name or location.

[0123] The license table (LT) 136 is a table identifying files, whichmay only be used by licensed users, in a manner independent of theirname or location, and the users licensed to use them.

[0124] Detailed Descriptions of the Data Structures

[0125] The following table summarizes the fields of an local directoryextensions table entry, as illustrated by record 138 in FIG. 3. FieldDescription Region ID identifies the region in which this file iscontained. Pathname the user provided name or contextual name of thefile or directory, relative to the region in which it occurs. True Namethe computed True Name or identity of the file or directory. This TrueName is not always up to date, and it is set to a special value when afile is modified and is later recomputed in the background. Typeindicates whether the file is a data file or a directory. Scratch thephysical location of the file in the File ID file system, when no TrueName has been calculated for the file. As noted above, such a file iscalled a scratch file. Time of the last access time to this file. Ifthis last file is a directory, this is the last access access time toany file in the directory. Time of the time of last change of this file.If last modification this file is a directory, this is the lastmodification time of any file in the directory. Safe flag indicates thatthis file (and, if this file is a directory, all of its subordinatefiles) have been backed up on some other system, and it is thereforesafe to remove them. Lock flag indicates whether a file is locked, thatis, it is being modified by the local processor or a remote processor.Only one processor may modify a file at a time. Size the full size ofthis directory (including all subordinate files), if all files in itwere fully expanded and duplicated. For a file that is not a directorythis is the size of the actual True File. Owner the identity of the userwho owns this file, for accounting and license tracking purposes.

[0126] Each record of the True File registry 126 has the fields shown inthe True File registry record 140 in FIG. 4. The True File registry 126consists of the database described in the table below as well as theactual True Files identified by the True File IDs below. FieldDescription True Name computed True Name or identity of the file.Compressed compressed version of the True File File ID may be storedinstead of, or in addition to, an uncompressed version. This fieldprovides the identity of the actual representation of the compressedversion of the file. Grooming tentative count of how many delete countreferences have been selected for deletion during a grooming operation.Time of last most recent date and time the access content of this filewas accessed. Expiration date and time after which this file may bedeleted by this server. Dependent processor IDs of other processorsprocessors which contain references to this True File. Source IDs sourceID(s) of zero or more sources from which this file or data item may beretrieved. True File ID identity or disk location of the actual physicalrepresentation of the file or file segment. It is sufficient to use afilename in the registration directory of the underlying operatingsystem. The True File ID is absent if the actual file is not currentlypresent at the current location. Use count number of other records onthis processor which identify this True File.

[0127] A region table 128, specified by a directory pathname, recordsstorage policies which allow files in the file system to be stored,accessed and migrated in different ways. Storage policies are programmedin a configurable way using a set of rules described below.

[0128] Each region table record 142 of region table 128 includes thefields described in the following table (with reference to FIG. 5):Field Description Region ID internally used identifier for this region.Region file system file system on the local processor of which thisregion is a part. Region pathname a pathname relative to the region filesystem which defines the location of this region. The region consists ofall files and directories subordinate to this pathname, except those ina region subordinate to this region. Mirror processor(s) zero or moreidentifiers of processors which are to keep mirror or archival copies ofall files in the current region. Multiple mirror processors can bedefined to form a mirror group. Mirror duplication number of copies ofeach file in this count region that should be retained in a mirrorgroup. Region status specifies whether this region is local to a singleprocessor 102, shared by several processors 102 (if, for instance, itresides on a shared file server), or managed by a remote processor.Policy the migration policy to apply to this region. A single regionmight participate in several policies. The policies are as follows(parameters in brackets are specified as part of the policy): region isa cached version from [processor ID]; region is a member of a mirror setdefined by [processor ID]. region is to be archived on [processor ID].region is to be backed up locally, by placing new copies in [region ID].region is read only and may not be changed. region is published andexpires on [date]. Files in this region should be compressed.

[0129] A source table 130 identifies a source location for True Files.The source table 130 is also used to identify client processors makingreservations on the current processor. Each source record 144 of thesource table 130 includes the fields summarized in the following table,with reference to FIG. 6: Field Description source ID internalidentifier used to identify a particular source. source type of sourcelocation: type Removable Storage Volume Local Region Cache Server MirrorGroup Server Cooperative Server Publishing Server Client source includesinformation about the rights rights of this processor, such as whetherit can ask the local processor to store data items for it. sourcemeasurement of the bandwidth, cost, availability and reliability of theconnection to this source of True Files. The availability is used toselect from among several possible sources. source information on howthe local processor location is to access the source. This may be, forexample, the name of a removable storage volume, or the processor ID andregion path of a region on a remote processor.

[0130] The audit file 132 is a table of events ordered by timestamp,each record 146 in audit file 132 including the fields summarized in thefollowing table (with reference to FIG. 7): Field Description OriginalName path of the file in question. Operation whether the file wascreated, read, written, copied or deleted. Type specifies whether thesource is a file or a directory. Processor ID ID of the remote processorgenerating this event (if not local). Timestamp time and date file wasclosed (required only for accessed/modified files). Pathname Name of thefile (required only for rename). True Name computed True Name of thefile. This is used by remote systems to mirror changes to the directoryand is filled in during background processing.

[0131] Each record 148 of the accounting log 134 records an event whichmay later be used to provide information for billing mechanisms. Eachaccounting log entry record 148 includes at least the informationsummarized in the following table, with reference to FIG. 8: FieldDescription date of date and time of this log entry. entry type of Entrytypes include create file, entry delete file, and transmit file. TrueName True Name of data item in question. owner identity of the userresponsible for this action.

[0132] Each record 150 of the license table 136 records a relationshipbetween a licensable data item and the user licensed to have access toit. Each license table record 150 includes the information summarized inthe following table, with reference to FIG. 9: Field Description TrueName True Name of a data item subject to license validation. licenseeidentity of a user authorized to have access to this object.

[0133] Various other data structures are employed on some or all of theprocessors 102 in the data processing system 100. Each processor 102 hasa global freeze lock (GFL) 152 (FIG. 1), which is used to preventsynchronization errors when a directory is frozen or copied. Anyprocessor 102 may include a special archive directory (SAD) 154 intowhich directories may be copied for the purposes of archival. Anyprocessor 102 may include a special media directory (SMD) 156, intowhich the directories of removable volumes are stored to form a mediainventory. Each processor has a grooming lock 158, which is set during agrooming operation. During this period the grooming delete count of TrueFile registry entries 140 is active, and no True Files should be deleteduntil grooming is complete. While grooming is in effect, groominginformation includes a table of pathnames selected for deletion, andkeeps track of the amount of space that would be freed if all of thefiles were deleted.

[0134] Primitive Mechanisms

[0135] The first of the mechanisms provided by the present invention,primitive mechanisms, are now described. The mechanisms described heredepend on underlying data management mechanisms to create, copy, read,and delete data items in the True File registry 126, as identified by aTrue File ID. This support may be provided by an underlying operatingsystem or disk storage manager.

[0136] The following primitive mechanisms are described:

[0137] 1. Calculate True Name;

[0138] 2. Assimilate Data Item;

[0139] 3. New True File;

[0140] 4. Get True Name from Path;

[0141] 5. Link Path to True Name;

[0142] 6. Realize True File from Location;

[0143] 7. Locate Remote File;

[0144] 8. Make True File Local;

[0145] 9. Create Scratch File;

[0146] 10. Freeze Directory;

[0147] 11. Expand Frozen Directory;

[0148] 12. Delete True File;

[0149] 13. Process Audit File Entry;

[0150] 14. Begin Grooming;

[0151] 15. Select For Removal; and

[0152] 16. End Grooming.

[0153] 1. Calculate True Name

[0154] A True Name is computed using a function, MD, which reduces adata block B of arbitrary length to a relatively small, fixed sizeidentifier, the True Name of the data block, such that the True Name ofthe data block is virtually guaranteed to represent the data block B andonly data block B.

[0155] The function MD must have the following properties:

[0156] 1. The domain of the function MD is the set of all data items.The range of the function MD is the set of True Names.

[0157] 2. The function MD must take a data item of arbitrary length andreduce it to an integer value in the range 0 to N−1, where N is thecardinality of the set of True Names. That is, for an arbitrary lengthdata block B, 0≦MD(B)<N.

[0158] 3. The results of MD(B) must be evenly and randomly distributedover the range of N, in such a way that simple or regular changes to Bare virtually guaranteed to produce a different value of MD(B).

[0159] 4. It must be computationally difficult to find a different valueB′ such that MD(B)=MD(B′).

[0160] 5. The function MD(B) must be efficiently computed.

[0161] A family of functions with the above properties are the so-calledmessage digest functions, which are used in digital security systems astechniques for authentification of data. These functions (or algorithms)include MD4, MD5, and SHA.

[0162] In the presently preferred embodiments, either MD5 or SHA isemployed as the basis for the computation of True Names. Whichever ofthese two message digest functions is employed, that same function mustbe employed on a system-wide basis.

[0163] It is impossible to define a function having a unique output foreach possible input when the number of elements in the range of thefunction is smaller than the number of elements in its domain. However,a crucial observation is that the actual data items that will beencountered in the operation of any system embodying this invention forma very sparse subset of all the possible inputs.

[0164] A colliding set of data items is defined as a set wherein, forone or more pairs x and y in the set, MD(x)=MD(y). Since a functionconforming to the requirements for MD must evenly and randomlydistribute its outputs, it is possible, by making the range of thefunction large enough, to make the probability arbitrarily small thatactual inputs encountered in the operation of an embodiment of thisinvention will form a colliding set.

[0165] To roughly quantify the probability of a collision, assume thatthere are no more than 2³⁰ storage devices in the world, and that eachstorage device has an average of at most 2²⁰ different data items. Thenthere are at most 2⁵⁰ data items in the world. If the outputs of MDrange between 0 and 2¹²⁸, it can be demonstrated that the probability ofa collision is approximately 1 in 2²⁹. Details on the derivation ofthese probability values are found, for example, in P. Flajolet and A.M. Odlyzko, “Random Mapping Statistics,” Lecture Notes in ComputerScience 434: Advances in Cryptology—Eurocrypt '89 Proceedings,Springer-Verlag, pp. 329-354.

[0166] Note that for some less preferred embodiments of the presentinvention, lower probabilities of uniqueness may be acceptable,depending on the types of applications and mechanisms used. In someembodiments it may also be useful to have more than one level of TrueNames, with some of the True Names having different degrees ofuniqueness. If such a scheme is implemented, it is necessary to ensurethat less unique True Names are not propagated in the system.

[0167] While the invention is described herein using only the True Nameof a data item as the identifier for the data item, other preferredembodiments use tagged, typed, categorized or classified data items anduse a combination of both the True Name and the tag, type, category orclass of the data item as an identifier. Examples of suchcategorizations are files, directories, and segments; executable filesand data files, and the like. Examples of classes are classes of objectsin an object-oriented system. In such a system, a lower degree of TrueName uniqueness is acceptable over the entire universe of data items, aslong as sufficient uniqueness is provided per category of data items.This is because the tags provide an additional level of uniqueness.

[0168] A mechanism for calculating a True Name given a data item is nowdescribed, with reference to FIGS. 10(a) and 10(b).

[0169] A simple data item is a data item whose size is less than aparticular given size (which must be defined in each particularimplementation of the invention). To determine the True Name of a-simpledata item, with reference to FIG. 10(a), first compute the MD function(described above) on the given simple data item (Step S212). Then appendto the resulting 128 bits, the byte length modulo 32 of the data item(Step S214). The resulting 160-bit value is the True Name of the simpledata item.

[0170] A compound data item is one whose size is greater than theparticular given size of a simple data item. To determine the True Nameof an arbitrary (simple or compound) data item, with reference to FIG.10(b), first determine if the data item is a simple or a compound dataitem (Step S216). If the data item is a simple data item, then computeits True Name in step S218 (using steps S212 and S214 described above),otherwise partition the data item into segments (Step S220) andassimilate each segment (Step S222) (the primitive mechanism, Assimilatea Data Item, is described below), computing the True Name of thesegment. Then create an indirect block consisting of the computedsegment True Names (Step S224). An indirect block is a data item whichconsists of the sequence of True Names of the segments. Then, in stepS226, assimilate the indirect block and compute its True Name. Finally,replace the final thirty-two (32) bits of the resulting True Name (thatis, the length of the indirect block) by the length modulo 32 of thecompound data item (Step S228). The result is the True Name of thecompound data item.

[0171] Note that the compound data item may be so large that theindirect block of segment True Names is itself a compound data item. Inthis case the mechanism is invoked recursively until only simple dataitems are being processed.

[0172] Both the use of segments and the attachment of a length to theTrue Name are not strictly required in a system using the presentinvention, but are currently considered desirable features in thepreferred embodiment.

[0173] 2. Assimilate Data Item

[0174] A mechanism for assimilating a data item (scratch file orsegment) into a file system, given the scratch file ID of the data item,is now described with reference to FIG. 11. The purpose of thismechanism is to add a given data item to the True File registry 126. Ifthe data item already exists in the True File registry 126, this will bediscovered and used during this process, and the duplicate will beeliminated.

[0175] Thereby the system stores at most one copy of any data item orfile by content, even when multiple names refer to the same content.

[0176] First, determine the True Name of the data item corresponding tothe given scratch File ID using the Calculate True Name primitivemechanism (Step S230). Next, look for an entry for the True Name in theTrue File registry 126 (Step S232) and determine whether a True Nameentry, record 140, exists in the True File registry 126. If the entryrecord includes a corresponding True File ID or compressed File ID (StepS237), delete the file with the scratch File ID (Step S238). Otherwisestore the given True File ID in the entry record (step S239).

[0177] If it is determined (in step S232) that no True Name entry existsin the True File registry 126, then, in Step S236, create a new entry inthe True File registry 126 for this True Name. Set the True Name of theentry to the calculated True Name, set the use count for the new entryto one, store the given True File ID in the entry and set the otherfields of the entry as appropriate.

[0178] Because this procedure may take some time to compute, it isintended to run in background after a file has ceased to change. In themeantime, the file is considered an unassimilated scratch file.

[0179] 3. New True File

[0180] The New True File process is invoked when processing the auditfile 132, some time after a True File has been assimilated (using theAssimilate Data Item primitive mechanism). Given a local directoryextensions table entry record 138 in the local directory extensionstable 124, the New True File process can provide the following steps(with reference to FIG. 12), depending on how the local processor isconfigured:

[0181] First, in step S238, examine the local directory extensions tableentry record 138 to determine whether the file is locked by a cacheserver. If the file is locked, then add the ID of the cache server tothe dependent processor list of the True File registry table 126, andthen send a message to the cache server to update the cache of thecurrent processor using the Update Cache remote mechanism (Step 242).

[0182] If desired, compress the True File (Step S246), and, if desired,mirror the True File using the Mirror True File background mechanism(Step S248).

[0183] 4. Get True Name from Path

[0184] The True Name of a file can be used to identify a file bycontents, to confirm that a file matches its original contents, or tocompare two files. The mechanism to get a True Name given the pathnameof a file is now described with reference to FIG. 13.

[0185] First, search the local directory extensions table 124 for theentry record 138 with the given pathname (Step S250). If the pathname isnot found, this process fails and no True Name corresponding to thegiven pathname exists. Next, determine whether the local directoryextensions table entry record 138 includes a True Name (Step S252), andif so, the mechanism's task is complete. Otherwise, determine whetherthe local directory extensions table entry record 138 identifies adirectory (Step S254), and if so, freeze the directory (Step S256) (theprimitive mechanism Freeze Directory is described below).

[0186] Otherwise, in step S258, assimilate the file (using theAssimilate Data Item primitive mechanism) defined by the File ID fieldto generate its True Name and store its True Name in the local directoryextensions entry record. Then return the True Name identified by thelocal directory extensions table 124.

[0187] 5. Link Path to True Name

[0188] The mechanism to link a path to a True Name provides a way ofcreating a new directory entry record identifying an existing,assimilated file. This basic process may be used to copy, move, andrename files without a need to copy their contents. The mechanism tolink a path to a True Name is now described with reference to FIG. 14.

[0189] First, if desired, confirm that the True Name exists locally bysearching for it in the True Name registry or local directory extensionstable 135 (Step S260). Most uses of this mechanism will require thisform of validation. Next, search for the path in the local directoryextensions table 135 (Step S262). Confirm that the directory containingthe file named in the path already exists (Step S264). If the named fileitself exists, delete the File using the Delete True File operatingsystem mechanism (see below) (Step S268).

[0190] Then, create an entry record in the local directory extensionswith the specified path (Step S270) and update the entry record andother data structures as follows: fill in the True Name field of theentry with the specified True Name; increment the use count for the TrueFile registry entry record 140 of the corresponding True Name; notewhether the entry is a directory by reading the True File to see if itcontains a tag (magic number) indicating that it represents a frozendirectory (see also the description of the Freeze Directory primitivemechanism regarding the tag); and compute and set the other fields ofthe local directory extensions appropriately. For instance, search theregion table 128 to identify the region of the path, and set the time oflast access and time of last modification to the current time.

[0191] 6. Realize True File from Location

[0192] This mechanism is used to try to make a local copy of a TrueFile, given its True Name and the name of a source location (processoror media) that may contain the True File. This mechanism is nowdescribed with reference to FIG. 15.

[0193] First, in step S272, determine whether the location specified isa processor. If it is determined that the location specified is aprocessor, then send a Request True File message (using the Request TrueFile remote mechanism) to the remote processor and wait for a response(Step S274). If a negative response is received or no response isreceived after a timeout period, this mechanism fails. If a positiveresponse is received, enter the True File returned in the True Fileregistry 126 (Step S276). (If the file received was compressed, enterthe True File ID in the compressed File ID field.)

[0194] If, on the other hand, it is determined in step S272 that thelocation specified is not a processor, then, if necessary, request theuser or operator to mount the indicated volume (Step S278). Then (StepS280) find the indicated file on the given volume and assimilate thefile using the Assimilate Data Item primitive mechanism. If the volumedoes not contain a True File registry 126, search the media inventory tofind the path of the file on the volume. If no such file can be found,this mechanism fails.

[0195] At this point, whether or not the location is determined (in stepS272) to be a processor, if desired, verify the True File (in stepS282).

[0196] 7. Locate Remote File

[0197] This mechanism allows a processor to locate a file or data itemfrom a remote source of True Files, when a specific source is unknown orunavailable. A client processor system may ask one of several or manysources whether it can supply a data object with a given True Name. Thesteps to perform this mechanism are as follows (with reference to FIG.16).

[0198] The client processor 102 uses the source table 145 to select oneor more source processors (Step S284). If no source processor can befound, the mechanism fails. Next, the client processor 102 broadcasts tothe selected sources a request to locate the file with the given TrueName using the Locate True File remote mechanism (Step S286). Therequest to locate may be augmented by asking to propagate this requestto distant servers. The client processor then waits for one or moreservers to respond positively (Step S288). After all servers respondnegatively, or after a timeout period with no positive response, themechanism repeats selection (Step S284) to attempt to identifyalternative sources. If any selected source processor responds, itsprocessor ID is the result of this mechanism. Store the processor ID inthe source field of the True File registry entry record 140 of the givenTrue Name (Step S290).

[0199] If the source location of the True Name is a different processoror medium than the destination (Step S290 a), perform the followingsteps:

[0200] (i) Look up the True File registry entry record 140 for thecorresponding True Name, and add the source location ID to the list ofsources for the True Name (Step S290 b); and

[0201] (ii) If the source is a publishing system, determine theexpiration date on the publishing system for the True Name and add thatto the list of sources. If the source is not a publishing system, send amessage to reserve the True File on the source processor (Step S290 c).

[0202] Source selection in step S284 may be based on optimizationsinvolving general availability of the source, access time, bandwidth,and transmission cost, and ignoring previously selected processors whichdid not respond in step S288.

[0203] 8. Make True File Local

[0204] This mechanism is used when a True Name is known and a locallyaccessible copy of the corresponding file or data item is required. Thismechanism makes it possible to actually read the data in a True File.The mechanism takes a True Name and returns when there is a local,accessible copy of the True File in the True File registry 126. Thismechanism is described here with reference to the flow chart of FIG. 17.

[0205] First, look in the True File registry 126 for a True File entryrecord 140 for the corresponding True Name (Step S292). If no such entryis found this mechanism fails. If there is already a True File ID forthe entry (Step S294), this mechanism's task is complete. If there is acompressed file ID for the entry (Step S296), decompress the filecorresponding to the file ID (Step S298) and store the decompressed fileID in the entry (Step. S300). This mechanism is then complete.

[0206] If there is no True File ID for the entry (Step S294) and thereis no compressed file ID for the entry (Step S296), then continuesearching for the requested file. At this time it may be necessary tonotify the user that the system is searching for the requested file.

[0207] If there are one or more source IDs, then select an order inwhich to attempt to realize the source ID (Step S304). The order may bebased on optimizations involving general availability of the source,access time, bandwidth, and transmission cost. For each source in theorder chosen, realize the True File from the source location (using theRealize True File from Location primitive mechanism), until the TrueFile is realized (Step S306). If it is realized, continue with stepS294. If no known source can realize the True File, use the LocateRemote File primitive mechanism to attempt to find the True File (StepS308). If this succeeds, realize the True File from the identifiedsource location and continue with step S296.

[0208] 9. Create Scratch File

[0209] A scratch copy of a file is required when a file is being createdor is about to be modified. The scratch copy is stored in the filesystem of the underlying operating system. The scratch copy iseventually assimilated when the audit file record entry 146 is processedby the Process Audit File Entry primitive mechanism. This Create ScratchFile mechanism requires a local directory extensions table entry record138. When it succeeds, the local directory extensions table entry record138 contains the scratch file ID of a scratch file that is not containedin the True File registry 126 and that may be modified. This mechanismis now described with reference to FIG. 18.

[0210] First determine whether the scratch file should be a copy of theexisting True File (Step S310). If so, continue with step S312.Otherwise, determine whether the local directory extensions table entryrecord 138 identifies an existing True File (Step S316), and if so,delete the True File using the Delete True File primitive mechanism(Step S318). Then create a new, empty scratch file and store its scratchfile ID in the local directory extensions table entry record 138 (StepS320). This mechanism is then complete.

[0211] If the local directory extensions table entry record 138identifies a scratch file ID (Step S312), then the entry already has ascratch file, so this mechanism succeeds.

[0212] If the local directory extensions table entry record 138identifies a True File (S316), and there is no True File ID for the TrueFile (S312), then make the True File local using the Make True FileLocal primitive mechanism (Step S322). If there is still no True FileID, this mechanism fails.

[0213] There is now a local True File for this file. If the use count inthe corresponding True File registry entry record 140 is one (StepS326), save the True File ID in the scratch file ID of the localdirectory extensions table entry record 138, and remove the True Fileregistry entry record 140 (Step S328). (This step makes the True Fileinto a scratch file.) This mechanism's task is complete.

[0214] Otherwise, if the use count in the corresponding True Fileregistry entry record 140 is not one (in step S326), copy the file withthe given True File ID to a new scratch file, using the Read File OSmechanism and store its file ID in the local directory extensions tableentry record 138 (Step S330), and reduce the use count for the True Fileby one. If there is insufficient space to make a copy, this mechanismfails.

[0215] 10. Freeze Directory

[0216] This mechanism freezes a directory in order to calculate its TrueName. Since the True Name of a directory is a function of the fileswithin the directory, they must not change during the computation of theTrue Name of the directory. This mechanism requires the pathname of adirectory to freeze. This mechanism is described with reference to FIG.19.

[0217] In step S332, add one to the global freeze lock. Then search thelocal directory extensions table 124 to find each subordinate data fileand directory of the given directory, and freeze each subordinatedirectory found using the Freeze Directory primitive mechanism (StepS334). Assimilate each unassimilated data file in the directory usingthe Assimilate Data Item primitive mechanism (Step S336). Then create adata item which begins with a tag or marker (a “magic number”) being aunique data item indicating that this data item is a frozen directory(Step S337). Then list the file name and True Name for each file in thecurrent directory (Step S338). Record any additional informationrequired, such as the type, time of last access and modification, andsize (Step S340). Next, in step S342, using the Assimilate Data Itemprimitive mechanism, assimilate the data item created in step S338. Theresulting True Name is the True Name of the frozen directory. Finally,subtract one from the global freeze lock (Step S344).

[0218] 11. Expand Frozen Directory

[0219] This mechanism expands a frozen directory in a given location. Itrequires a given pathname into which to expand the directory, and theTrue Name of the directory and is described with reference to FIG. 20.

[0220] First, in step S346, make the True File with the given True Namelocal using the Make True File Local primitive mechanism. Then read eachdirectory entry in the local file created in step S346 (Step S348). Foreach such directory entry, do the following:

[0221] Create a full pathname using the given pathname and the file nameof the entry (Step S350); and

[0222] link the created path to the True Name (Step S352) using the LinkPath to True Name primitive mechanism.

[0223] 12. Delete True File

[0224] This mechanism deletes a reference to a True Name. The underlyingTrue File is not removed from the True File registry 126 unless thereare no additional references to the file. With reference to FIG. 21,this mechanism is performed as follows:

[0225] If the global freeze lock is on, wait until the global freezelock is turned off (Step S354). This prevents deleting a True File whilea directory which might refer to it is being frozen. Next, find the TrueFile registry entry record 140 given the True Name (Step S356). If thereference count field of the True File registry 126 is greater thanzero, subtract one from the reference count field (Step S358). If it isdetermined (in step S360) that the reference count field of the TrueFile registry entry record 140 is zero, and if there are no dependentsystems listed in the True File registry entry record 140, then performthe following steps:

[0226] (i) If the True File is a simple data item, then delete the TrueFile, otherwise,

[0227] (ii) (the True File is a compound data item) for each True Namein the data item, recursively delete the True File corresponding to theTrue Name (Step S362).

[0228] (iii) Remove the file indicated by the True File ID andcompressed file ID from the True File registry 126, and remove the TrueFile registry entry record 140 (Step S364).

[0229] 13. Process Audit File Entry

[0230] This mechanism performs tasks which are required to maintaininformation in the local directory extensions table 124 and True Fileregistry 126, but which can be delayed while the processor is busy doingmore time-critical tasks. Entries 142 in the audit file 132 should beprocessed at a background priority as long as there are entries to beprocessed. With reference to FIG. 22, the steps for processing an entryare as follows:

[0231] Determine the operation in the entry 142 currently beingprocessed (Step S365). If the operation indicates that a file wascreated or written (Step S366), then assimilate the file using theAssimilate Data Item primitive mechanism (Step S368), use the New TrueFile primitive mechanism to do additional desired processing (such ascache update, compression, and mirroring) (Step S369), and record thenewly computed True Name for the file in the audit file record entry(Step S370).

[0232] Otherwise, if the entry being processed indicates that a compounddata item or directory was copied (or deleted) (Step S376), then foreach component True Name in the compound data item or directory, add (orsubtract) one to the use count of the True File registry entry record140 corresponding to the component True Name (Step S378).

[0233] In all cases, for each parent directory of the given file, updatethe size, time of last access, and time of last modification, accordingto the operation in the audit record (Step S379).

[0234] Note that the audit record is not removed after processing, butis retained for some reasonable period so that it may be used by theSynchronize Directory extended mechanism to allow a disconnected remoteprocessor to update its representation of the local system.

[0235] 14. Begin Grooming

[0236] This mechanism makes it possible to select a set of files forremoval and determine the overall amount of space to be recovered. Withreference to FIG. 23, first verify that the global grooming lock iscurrently unlocked (Step S382). Then set the global grooming lock, setthe total amount of space freed during grooming to zero and empty thelist of files selected for deletion (Step S384). For each True File inthe True File registry 126, set the delete count to zero (Step S386).

[0237] 15. Select for Removal

[0238] This grooming mechanism tentatively selects a pathname to allowits corresponding True File to be removed. With reference to FIG. 24,first find the local directory extensions table entry record 138corresponding to the given pathname (Step S388). Then find the True Fileregistry entry record 140 corresponding to the True File name in thelocal directory extensions table entry record 138 (Step S390). Add oneto the grooming delete count in the True File registry entry record 140and add the pathname to a list of files selected for deletion (StepS392). If the grooming delete count of the True File registry entryrecord 140 is equal to the use count of the True File registry entryrecord 140, and if the there are no entries in the dependency list ofthe True File registry entry record 140, then add the size of the fileindicated by the True File ID and or compressed file ID to the totalamount of space freed during grooming (Step S394).

[0239] 16. End Grooming

[0240] This grooming mechanism ends the grooming phase and removes allfiles selected for removal. With reference to FIG. 25, for each file inthe list of files selected for deletion, delete the file (Step S396) andthen unlock the global grooming lock (Step S398).

[0241] Operating System Mechanisms

[0242] The next of the mechanisms provided by the present invention,operating system mechanisms, are now described.

[0243] The following operating system mechanisms are described:

[0244] 1. Open File;

[0245] 2. Close File;

[0246] 3. Read File;

[0247] 4. Write File;

[0248] 5. Delete File or Directory;

[0249] 6. Copy File or Directory;

[0250] 7. Move File or Directory;

[0251] 8. Get File Status; and

[0252] 9. Get Files in Directory.

[0253] 1. Open File

[0254] A mechanism to open a file is described with reference to FIG.26. This mechanism is given as input a pathname and the type of accessrequired for the file (for example, read, write, read/write, create,etc.) and produces either the File ID of the file to be opened or anindication that no file should be opened. The local directory extensionstable record 138 and region table record 142 associated with the openedfile are associated with the open file for later use in other processingfunctions which refer to the file, such as read, write, and close.

[0255] First, determine whether or not the named file exists locally byexamining the local directory extensions table 124 to determine whetherthere is an entry corresponding to the given pathname (Step S400). If itis determined that the file name does not exist locally, then, using theaccess type, determine whether or not the file is being created by thisopening process (Step S402). If the file is not being created, prohibitthe open (Step S404). If the file is being-created, create a zero-lengthscratch file using an entry in local directory extensions table 124 andproduce the scratch file ID of this scratch file as the result (StepS406).

[0256] If, on the other hand, it is determined in step S400 that thefile name does exist locally, then determine the region in which thefile is located by searching the region table 128 to find the record 142with the longest region path which is a prefix of the file pathname(Step S408). This record identifies the region of the specified file.

[0257] Next, determine using the access type, whether the file is beingopened for writing or whether it is being opened only for reading (StepS410). If the file is being opened for reading only, then, if the fileis a scratch file (Step S419), return the scratch File ID of the file(Step S424). Otherwise get the True Name from the local directoryextensions table 124 and make a local version of the True Fileassociated with the True Name using the Make True File Local primitivemechanism, and then return the True File ID associated with the TrueName (Step S420).

[0258] If the file is not being opened for reading only (Step S410),then, if it is determined by inspecting the region table entry record142 that the file is in a read-only directory (Step S416), then prohibitthe opening (Step S422).

[0259] If it is determined by inspecting the region table 128 that thefile is in a cached region (Step S423), then send a Lock Cache messageto the corresponding cache server, and wait for a return message (StepS418). If the return message says the file is already locked, prohibitthe opening.

[0260] If the access type indicates that the file being modified isbeing rewritten completely (Step S419), so that the original data willnot be required, then Delete the File using the Delete File OS mechanism(Step S421) and perform step S406. Otherwise, make a scratch copy of thefile (Stop S417) and produce the scratch file ID of the scratch file asthe result (Step S424).

[0261] 2. Close File

[0262] This mechanism takes as input the local directory extensionstable entry record 138 of an open file and the data maintained for theopen file. To close a file, add an entry to the audit file indicatingthe time and operation (create, read or write). The audit fileprocessing (using the Process Audit File Entry primitive mechanism) willtake care of assimilating the file and thereby updating the otherrecords.

[0263] 3. Read File

[0264] To read a file, a program must provide the offset and length ofthe data to be read, and the location of a buffer into which to copy thedata read.

[0265] The file to be read from is identified by an open file descriptorwhich includes a File ID as computed by the Open File operating systemmechanism defined above. The File ID may identify either a scratch fileor a True File (or True File segment). If the File ID identifies a TrueFile, it may be either a simple or a compound True File. Reading a fileis accomplished by the following steps:

[0266] In the case where the File ID identifies a scratch file or asimple True File, use the read capabilities of the underlying operatingsystem.

[0267] In the case where the File ID identifies a compound file, breakthe read operation into one or more read operations on componentsegments as follows:

[0268] A. Identify the segment(s) to be read by dividing the specifiedfile offset and length each by the fixed size of a segment (a systemdependent parameter), to determine the segment number and number ofsegments that must be read.

[0269] B. For each segment number computed above, do the following:

[0270] i. Read the compound True File index block to determine the TrueName of the segment to be read.

[0271] ii. Use the Realize True File from Location primitive mechanismto make the True File segment available locally. (If that mechanismfails, the Read File mechanism fails).

[0272] iii. Determine the File ID of the True File specified by the TrueName corresponding to this segment.

[0273] iv. Use the Read File mechanism (recursively) to read from thissegment into the corresponding location in the specified buffer.

[0274] 4. Write File

[0275] File writing uses the file ID and data management capabilities ofthe underlying operating system. File access (Make File Local describedabove) can be deferred until the first read or write.

[0276] 5. Delete File or Directory

[0277] The process of deleting a file, for a given pathname, isdescribed here with reference to FIG. 27.

[0278] First, determine the local directory extensions table entryrecord 138 and region table entry record 142 for the file (Step S422).If the file has no local directory extensions table entry record 138 oris locked or is in a read-only region, prohibit the deletion.

[0279] Identify the corresponding True File given the True Name of thefile being deleted using the True File registry 126 (Step S424). If thefile has no True Name, (Step S426) then delete the scratch copy of thefile based on its scratch file ID in the local directory extensionstable 124 (Step S427), and continue with step S428.

[0280] If the file has a True Name and the True File's use count is one(Step S429), then delete the True File (Step S430), and continue withstep S428.

[0281] If the file has a True Name and the True File's use count isgreater than one, reduce its use count by one (Step S431). Then proceedwith step S428.

[0282] In Step S428, delete the local directory extensions table entryrecord, and add an entry to the audit file 132 indicating the time andthe operation performed (delete).

[0283] 6. Copy File or Directory

[0284] A mechanism is provided to copy a file or directory given asource and destination processor and pathname. The Copy File mechanismdoes not actually copy the data in the file, only the True Name of thefile. This mechanism is performed as follows:

[0285] (A) Given the source path, get the True Name from the path. Ifthis step fails, the mechanism fails.

[0286] (B) Given the True Name and the destination path, link thedestination path to the True Name.

[0287] (C) If the source and destination processors have different TrueFile registries, find (or, if necessary, create) an entry for the TrueName in the True File registry table 126 of the destination processor.Enter into the source ID field of this new entry the source processoridentity.

[0288] (D) Add an entry to the audit file 132 indicating the time andoperation performed (copy).

[0289] This mechanism addresses capability of the system to avoidcopying data from a source location to a destination location when thedestination already has the data. In addition, because of the ability tofreeze a directory, this mechanism also addresses capability of thesystem immediately to make a copy of any collection of files, thereby tosupport an efficient version control mechanisms for groups of files.

[0290] 7. Move File or Directory

[0291] A mechanism is described which moves (or renames) a file from asource path to a destination path. The move operation, like the copyoperation, requires no actual transfer of data, and is performed asfollows:

[0292] (A) Copy the file from the source path to the destination path.

[0293] (B) If the source path is different from the destination path,delete the source path.

[0294] 8. Get File Status

[0295] This mechanism takes a file pathname and provides informationabout the pathname. First the local directory extensions table entryrecord 138 corresponding to the pathname given is found. If no suchentry exists, then this mechanism fails, otherwise, gather informationabout the file and its corresponding True File from the local directoryextensions table 124. The information can include any information shownin the data structures, including the size, type, owner, True Name,sources, time of last access, time of last modification, state (local ornot, assimilated or not, compressed or not), use count, expiration date,and reservations.

[0296] 9. Get Files in Directory

[0297] This mechanism enumerates the files in a directory. It is used(implicitly) whenever it is necessary to determine whether a file exists(is present) in a directory. For instance, it is implicitly used in theOpen File, Delete File, Copy File or Directory, and Move File operatingsystem mechanisms, because the files operated on are referred to bypathnames containing directory names. The mechanism works as follows:

[0298] The local directory extensions table 124 is searched for an entry138 with the given directory pathname. If no such entry is found, or ifthe entry found is not a directory, then this mechanism fails.

[0299] If there is a corresponding True File field in the localdirectory extensions table record, then it is assumed that the True Filerepresents a frozen directory. The Expand Frozen Directory primitivemechanism is used to expand the existing True File into directoryentries in the local directory extensions table.

[0300] Finally, the local directory extensions table 124 is againsearched, this time to find each directory subordinate to the givendirectory. The names found are provided as the result.

[0301] Remote Mechanisms

[0302] The remote mechanisms provided by the present invention are nowdescribed. Recall that remote mechanisms are used by the operatingsystem in responding to requests from other processors. These mechanismsenable the capabilities of the present invention in a peer-to-peernetwork mode of operation.

[0303] In a presently preferred embodiment, processors communicate witheach other using a remote procedure call (RPC) style interface, runningover one of any number of communication protocols such as IPX/SPX orTCP/IP. Each peer processor which provides access to its True Fileregistry 126 or file regions, or which depends on another peerprocessor, provides a number of mechanisms which can be used by itspeers.

[0304] The following remote mechanisms are described:

[0305] 1. Locate True File;

[0306] 2. Reserve True File;

[0307] 3. Request True File;

[0308] 4. Retire True File;

[0309] 5. Cancel Reservation;

[0310] 6. Acquire True File;

[0311] 7. Lock Cache;

[0312] 8. Update Cache; and

[0313] 9. Check Expiration Date.

[0314] 1. Locate True File

[0315] This mechanism allows a remote processor to determine whether thelocal processor contains a copy of a specific True File. The mechanismbegins with a True Name and a flag indicating whether to forwardrequests for this file to other servers. This mechanism is now describedwith reference to FIG. 28.

[0316] First determine if the True File is available locally or if thereis some indication of where the True File is located (for example, inthe Source IDs field). Look up the requested True Name in the True Fileregistry 126 (Step S432).

[0317] If a True File registry entry record 140 is not found for thisTrue Name (Step S434), and the flag indicates that the request is not tobe forwarded (Step S436), respond negatively (Step S438). That is,respond to the effect that the True File is not available.

[0318] One the other hand, if a True File registry entry record 140 isnot found (Step S434), and the flag indicates that the request for thisTrue File is to be forwarded (Step S436), then forward a request forthis True File to some other processors in the system (Step S442). Ifthe source table for the current processor identifies one or morepublishing servers which should have a copy of this True File, thenforward the request to each of those publishing servers (Step S436).

[0319] If a True File registry entry record 140 is found for therequired True File (Step S434), and if the entry includes a True File IDor Compressed File ID (Step S440), respond positively (Step S444). Ifthe entry includes a True File ID then this provides the identity ordisk location of the actual physical representation of the file or filesegment required. If the entry include a Compressed File ID, then acompressed version of the True File may be stored instead of, or inaddition to, an uncompressed version. This field provides the identityof the actual representation of the compressed version of the file.

[0320] If the True File registry entry record 140 is found (Step S434)but does not include a True File ID (the File ID is absent if the actualfile is not currently present at the current location) (Step S440), andif the True File registry entry record 140 includes one or more sourceprocessors, and if the request can be forwarded, then forward therequest for this True File to one or more of the source processors (StepS444).

[0321] 2. Reserve True File

[0322] This mechanism allows a remote processor to indicate that itdepends on the local processor for access to a specific True File. Ittakes a True Name as input. This mechanism is described here.

[0323] (A) Find the True File registry entry record 140 associated withthe given True File. If no entry exists, reply negatively.

[0324] (B) If the True File registry entry record 140 does not include aTrue File ID or compressed File ID, and if the True File registry entryrecord 140 includes no source IDs for removable storage volumes, thenthis processor does not have access to a copy of the given file. Replynegatively.

[0325] (C) Add the ID of the sending processor to the list of dependentprocessors for the True File registry entry record 140. Replypositively, with an indication of whether the reserved True File is online or off line.

[0326] 3. Request True File

[0327] This mechanism allows a remote processor to request a copy of aTrue File from the local processor. It requires a True Name and respondspositively by sending a True File back to the requesting processor. Themechanism operates as follows:

[0328] (A) Find the True File registry entry record 140 associated withthe given True Name. If there is no such True File registry entry record140, reply negatively.

[0329] (B) Make the True File local using the Make True File Localprimitive mechanism. If this mechanism fails, the Request True Filemechanism also fails.

[0330] (C) Send the local True File in either it is uncompressed orcompressed form to the requesting remote processor. Note that if theTrue File is a compound file, the components are not sent.

[0331] (D) If the remote file is listed in the dependent process list ofthe True File registry entry record 140, remove it.

[0332] 4. Retire True File

[0333] This mechanism allows a remote processor to indicate that it nolonger plans to maintain a copy of a given True File. An alternatesource of the True File can be specified, if, for instance, the TrueFile is being moved from one server to another. It begins with a TrueName, a requesting processor ID, and an optional alternate source. Thismechanism operates as follows:

[0334] (A) Find a True Name entry in the True File registry 126. Ifthere is no entry for this True Name, this mechanism's task is complete.

[0335] (B) Find the requesting processor on the source list and, if itis there, remove it.

[0336] (C) If an alternate source is provided, add it to the source listfor the True File registry entry record 140.

[0337] (D) If the source list of the True File registry entry record 140has no items in it, use the Locate Remote File primitive mechanism tosearch for another copy of the file. If it fails, raise a serious error.

[0338] 5. Cancel Reservation

[0339] This mechanism allows a remote processor to indicate that it nolonger requires access to a True File stored on the local processor. Itbegins with a True Name and a requesting processor ID and proceeds asfollows:

[0340] (A) Find the True Name entry in the True File registry 126. Ifthere is no entry for this True Name, this mechanism's task is complete.

[0341] (B) Remove the identity of the requesting processor from the listof dependent processors, if it appears.

[0342] (C) If the list of dependent processors becomes zero and the usecount is also zero, delete the True File.

[0343] 6. Acquire True File

[0344] This mechanism allows a remote processor to insist that a localprocessor make a copy of a specified True File. It is used, for example,when a cache client wants to write through a new version of a file. TheAcquire True File mechanism begins with a data item and an optional TrueName for the data item and proceeds as follows:

[0345] (A) Confirm that the requesting processor has the right torequire the local processor to acquire data items. If not, send anegative reply.

[0346] (B) Make a local copy of the data item transmitted by the remoteprocessor.

[0347] (C) Assimilate the data item into the True File registry of thelocal processor.

[0348] (D) If a True Name was provided with the file, the True Namecalculation can be avoided, or the mechanism can verify that the filereceived matches the True Name sent.

[0349] (E) Add an entry in the dependent processor list of the true fileregistry record indicating that the requesting processor depends on thiscopy of the given True File.

[0350] (F) Send a positive reply.

[0351] 7. Lock Cache

[0352] This mechanism allows a remote cache client to lock a local fileso that local users or other cache clients cannot change it while theremote processor is using it. The mechanism begins with a pathname andproceeds as follows:

[0353] (A) Find the local directory extensions table entry record 138 ofthe specified pathname. If no such entry exists, reply negatively.

[0354] (B) If an local directory extensions table entry record 138exists and is already locked, reply negatively that the file is alreadylocked.

[0355] (C) If an local directory extensions table entry record 138exists and is not locked, lock the entry. Reply positively.

[0356] 8. Update Cache

[0357] This mechanism allows a remote cache client to unlock a localfile and update it with new contents. It begins with a pathname and aTrue Name. The file corresponding to the True Name must be accessiblefrom the remote processor. This mechanism operates as follows:

[0358] Find the local directory extensions table entry record 138corresponding to the given pathname. Reply negatively if no such entryexists or if the entry is not locked.

[0359] Link the given pathname to the given True Name using the LinkPath to True Name primitive mechanism.

[0360] Unlock the local directory extensions table entry record 138 andreturn positively.

[0361] 9. Check Expiration Date

[0362] Return current or new expiration date and possible alternativesource to caller.

[0363] Background Processes and Mechanisms

[0364] The background processes and mechanisms provided by the presentinvention are now described. Recall that background mechanisms areintended to run occasionally and at a low priority to provide automatedmanagement capabilities with respect to the present invention.

[0365] The following background mechanisms are described:

[0366] 1. Mirror True File;

[0367] 2. Groom Region;

[0368] 3. Check for Expired Links;

[0369] 4. Verify Region; and

[0370] 5. Groom Source List.

[0371] 1. Mirror True File

[0372] This mechanism is used to ensure that files are available inalternate locations in mirror groups or archived on archival servers.The mechanism depends on application-specific migration/archivalcriteria (size, time since last access, number of copies required,number of existing alternative sources) which determine under whatconditions a file should be moved. The Mirror True File mechanismoperates as follows, using the True File specified, perform thefollowing steps:

[0373] (A) Count the number of available locations of the True File byinspecting the source list of the True File registry entry record 140for the True File. This step determines how many copies of the True.File are available in the system.

[0374] (B) If the True File meets the specified migration criteria,select a mirror group server to which a copy of the file should be sent.Use the Acquire True File remote mechanism to copy the True File to theselected mirror group server. Add the identity of the selected system tothe source list for the True File.

[0375] 2. Groom Region

[0376] This mechanism is used to automatically free up space in aprocessor by deleting data items that may be available elsewhere. Themechanism depends on application-specific grooming criteria (forinstance, a file may be removed if there is an alternate online sourcefor it, it has not been accessed in a given number of days, and it islarger than a given size). This mechanism operates as follows:

[0377] Repeat the following steps (i) to (iii) with more aggressivegrooming criteria until sufficient space is freed or until all groomingcriteria have been exercised. Use grooming information to determine howmuch space has been freed. Recall that, while grooming is in effect,grooming information includes a table of pathnames selected fordeletion, and keeps track of the amount of space that would be freed ifall of the files were deleted.

[0378] (i) Begin Grooming (using the primitive mechanism).

[0379] (ii) For each pathname in the specified region, for the True Filecorresponding to the pathname, if the True File is present, has at leastone alternative source, and meets application specific grooming criteriafor the region, select the file for removal (using the primitivemechanism).

[0380] (iii) End Grooming (using the primitive mechanism).

[0381] If the region is used as a cache, no other processors aredependent on True Files to which it refers, and all such True Files aremirrored elsewhere. In this case, True Files can be removed withimpunity. For a cache region, the grooming criteria would ordinarilyeliminate the least recently accessed True Files first. This is bestdone by sorting the True Files in the region by the most recent accesstime before performing step (ii) above. The application specificcriteria would thus be to select for removal every True File encountered(beginning with the least recently used) until the required amount offree space is reached.

[0382] 3. Check for Expired Links

[0383] This mechanism is used to determine whether dependencies onpublished files should be refreshed. The following steps describe theoperation of this mechanism:

[0384] For each pathname in the specified region, for each True Filecorresponding to the pathname, perform the following step:

[0385] If the True File registry entry record 140 corresponding to theTrue File contains at least one source which is a publishing server, andif the expiration date on the dependency is past or close, then performthe following steps:

[0386] (A) Determine whether the True. File registry entry recordcontains other sources which have not expired.

[0387] (B) Check the True Name expiration of the server. If theexpiration date has been extended, or an alternate source is suggested,add the source to the True File registry entry record 140.

[0388] (C) If no acceptable alternate source was found in steps (A) or(B) above, make a local copy of the True File.

[0389] (D) Remove the expired source.

[0390] 4. Verify Region

[0391] This mechanism can be used to ensure that the data items in theTrue File registry 126 have not been damaged accidentally ormaliciously. The operation of this mechanism is described by thefollowing steps:

[0392] (A) Search the local directory extensions table 124 for eachpathname in the specified region and then perform the following steps:

[0393] (i) Get the True File name corresponding to the pathname;

[0394] (ii) If the True File registry entry 140 for the True File doesnot have a True File ID or compressed file ID, ignore it.

[0395] (iii) Use the Verify True File mechanism (see extended mechanismsbelow) to confirm that the True File specified is correct.

[0396] 5. Groom Source List

[0397] The source list in a True File entry should be groomed sometimesto ensure there are not too many mirror or archive copies. When a fileis deleted or when a region definition or its mirror criteria arechanged, it may be necessary to inspect the affected True Files todetermine whether there are too many mirror copies. This can be donewith the following steps:

[0398] For each affected True File,

[0399] (A) Search the local directory extensions table to find eachregion that refers to the True File.

[0400] (B) Create a set of “required sources”, initially empty.

[0401] (C) For each region found,

[0402] (a) determine the mirroring criteria for that region,

[0403] (b) determine which sources for the True File satisfy themirroring criteria, and

[0404] (c) add these sources to the set of required sources.

[0405] (D) For each source in the True File registry entry, if thesource identifies a remote processor (as opposed to removable media),and if the source is not a publisher, and if the source is not in theset of required sources, then eliminate the source, and use the CancelReservation remote mechanism to eliminate the given processor from thelist of dependent processors recorded at the remote processor identifiedby the source.

[0406] Extended Mechanisms

[0407] The extended mechanisms provided by the present invention are nowdescribed. Recall that extended mechanisms run within applicationprograms over the operating system to provide solutions to specificproblems and applications.

[0408] The following extended mechanisms are described:

[0409] 1. Inventory Existing Directory;

[0410] 2. Inventory Removable, Read-only Files;

[0411] 3. Synchronize Directories;

[0412] 4. Publish Region;

[0413] 5. Retire Directory;

[0414] 6. Realize Directory at Location;

[0415] 7. Verify True File;

[0416] 8. Track for Accounting Purposes; and

[0417] 9. Track for Licensing Purposes.

[0418] 1. Inventory Existing Directory

[0419] This mechanism determines the True Names of files in an existingon-line directory in the underlying operating system. One purpose ofthis mechanism is to install True Name mechanisms in an existing filesystem.

[0420] An effect of such an installation is to eliminate immediately allduplicate files from the file system being traversed. If several filesystems are inventoried in a single True File registry, duplicatesacross the volumes are also eliminated.

[0421] (A) Traverse the underlying file system in the operating system.For each file encountered, excluding directories, perform the following:

[0422] (i) Assimilate the file encountered (using the Assimilate Fileprimitive mechanism). This process computes its True Name and moves itsdata into the True File registry 126.

[0423] (ii) Create a pathname consisting of the path to the volumedirectory and the relative path of the file on the media. Link this pathto the computed True Name using the Link Path to True Name primitivemechanism.

[0424] 2. Inventory Removable, Read-Only Files

[0425] A system with ?access to removable, read-only media volumes (suchas WORM disks and CD-ROMs) can create a usable inventory of the files onthese disks without having to make online copies. These objects can thenbe used for archival purposes, directory overlays, or other needs. Anoperator must request that an inventory be created for such a volume.

[0426] This mechanism allows for maintaining inventories of the contentsof files and data items on removable media, such as diskettes andCD-ROMs, independent of other properties of the files such as name,location, and date of creation.

[0427] The mechanism creates an online inventory of the files on one ormore removable volumes, such as a floppy disk or CD-ROM, when the dataon the volume is represented as a directory. The inventory service usesa True Name to identify each file, providing a way to locate the dataindependent of its name, date of creation, or location.

[0428] The inventory can be used for archival of data (making itpossible to avoid archiving data when that data is already on a separatevolume), for grooming (making it possible to delete infrequentlyaccessed files if they can be retrieved from removable volumes), forversion control (making it possible to generate a new version of aCD-ROM without having to copy the old version), and for other purposes.

[0429] The inventory is made by creating a volume directory in the mediainventory in which each file named identifies the data item on thevolume being inventoried. Data items are not copied from the removablevolume during the inventory process.

[0430] An operator must request that an inventory be created for aspecific volume. Once created, the volume directory can be frozen orcopied like any other directory. Data items from either the physicalvolume or the volume directory can be accessed using the Open Fileoperating system mechanism which will cause them to be read from thephysical volume using the Realize True File from Location primitivemechanism.

[0431] To create an inventory the following steps are taken:

[0432] (A) A volume directory in the media inventory is created tocorrespond to the volume being inventoried. Its contextual nameidentifies the specific volume.

[0433] (B) A source table entry 144 for the volume is created in thesource table 130. This entry 144 identifies the physical source volumeand the volume directory created in step (A).

[0434] (C) The file system on the volume is traversed. For each fileencountered, excluding directories, the following steps are taken:

[0435] (i) The True Name of the file is computed. An entry is created inthe True Name registry 124, including the True Name of the file usingthe primitive mechanism. The source field of the True Name registryentry 140 identifies the source table entry 144.

[0436] (ii) A pathname is created consisting of the path to the volumedirectory and the relative path of the file on the media. This path islinked to the computed True Name using Link Path to True Name primitivemechanism.

[0437] (D) After all files have been inventoried, the volume directoryis frozen. The volume directory serves as a table of contents for thevolume. It can be copied using the Copy File or Directory primitivemechanism to create an “overlay” directory which can then be modified,making it possible to edit a virtual copy of a read-only medium.

[0438] 3. Synchronize Directories

[0439] Given two versions of a directory derived from the same startingpoint, this mechanism creates a new, synchronized version which includesthe changes from each. Where a file is changed in both versions, thismechanism provides a user exit for handling the discrepancy. By usingTrue Names, comparisons are instantaneous, and no copies of files arenecessary.

[0440] This mechanism lets a local processor synchronize a directory toaccount for changes made at a remote processor. Its purpose is to bringa local copy of a directory up to date after a period of nocommunication between the local and remote processor. Such a periodmight occur if the local processor were a mobile processor detached fromits server, or if two distant processors were run independently andupdated nightly.

[0441] An advantage of the described synchronization process is that itdoes not depend on synchronizing the clocks of the local and remoteprocessors. However, it does require that the local processor track itsposition in the remote processor's audit file.

[0442] This mechanism does not resolve changes made simultaneously tothe same file at several sites. If that occurs, an external resolutionmechanism such as, for example, operator intervention, is required.

[0443] The mechanism takes as input a start time, a local directorypathname, a remote processor name, and a remote directory pathname name,and it operates by the following steps:

[0444] (A) Request a copy of the audit file 132 from the remoteprocessor using the Request True File remote mechanism.

[0445] (B) For each entry 146 in the audit file 132 after the starttime, if the entry indicates a change to a file in the remote directory,perform the following steps:

[0446] (i) Compute the pathname of the corresponding file in the localdirectory. Determine the True Name of the corresponding file.

[0447] (ii) If the True Name of the local file is the same as the oldTrue Name in the audit file, or if there is no local file and the auditentry indicates a new file is being created, link the new True Name inthe audit file to the local pathname using the Link Path to True Nameprimitive mechanism.

[0448] (iii) Otherwise, note that there is a problem with thesynchronization by sending a message to the operator or to a problemresolution program, indicating the local pathname, remote pathname,remote processor, and time of change.

[0449] (C) After synchronization is complete, record the time of thefinal change. This time is to be used as the new start time the nexttime this directory is synchronized with the same remote processor.

[0450] 4. Publish Region

[0451] The publish region mechanism allows a processor to offer thefiles in a region to any client processors for a limited period of time.

[0452] The purpose of the service is to eliminate any need for clientprocessors to make reservations with the publishing processor. This inturn makes it possible for the publishing processor to service a muchlarger number of clients.

[0453] When a region is published, an expiration date is defined for allfiles in the region, and is propagated into the publishing system's TrueFile registry entry record 140 for each file.

[0454] When a remote file is copied, for instance using the Copy Fileoperating system mechanism, the expiration date is copied into thesource field of the client's True File registry entry record 140. Whenthe source is a publishing system, no dependency need be created.

[0455] The client processor must occasionally and in background, checkfor expired links, to make sure it still has access to these files. Thisis described in the background mechanism Check for Expired Links.

[0456] 5. Retire Directory

[0457] This mechanism makes it possible to eliminate safely the TrueFiles in a directory, or at least dependencies on them, after ensuringthat any client processors depending on those files remove theirdependencies. The files in the directory are not actually deleted bythis process. The directory can be deleted with the Delete Fileoperating system mechanism.

[0458] The mechanism takes the pathname of a given directory, andoptionally, the identification of a preferred alternate source processorfor clients to use. The mechanism performs the following steps:

[0459] (A) Traverse the directory. For each file in the directory,perform the following steps:

[0460] (i) Get the True Name of the file from its path and find the TrueFile registry entry 140 associated with the True Name.

[0461] (ii) Determine an alternate source for the True File. If thesource IDs field of the TFR entry includes the preferred alternatesource, that is the alternate source. If it does not, but includes someother source, that is the alternate source. If it contains no alternatesources, there is no alternate source.

[0462] (iii) For each dependent processor in the True File registryentry 140, ask that processor to retire the True File, specifying analternate source if one was determined, using the remote mechanism.

[0463] 6. Realize Directory at Location

[0464] This mechanism allows the user or operating system to forcecopies of files from some source location to the True File registry 126at a given location. The purpose of the mechanism is to ensure thatfiles are accessible in the event the source location becomesinaccessible. This can happen for instance if the source or givenlocation are on mobile computers, or are on removable media, or if thenetwork connection to the source is expected to become unavailable, orif the source is being retired.

[0465] This mechanism is provided in the following steps for each filein the given directory, with the exception of subdirectories:

[0466] (A) Get the local directory extensions table entry record 138given the pathname of the file. Get the True Name of the local directoryextensions table entry record 138. This service assimilates the file ifit has not already been assimilated.

[0467] (B) Realize the corresponding True File at the given location.This service causes it to be copied to the given location from a remotesystem or removable media.

[0468] 7. Verify True File

[0469] This mechanism is used to verify that the data item in a TrueFile registry 126 is indeed the correct data item given its True Name.Its purpose is to guard against device errors, malicious changes, orother problems.

[0470] If an error is found, the system has the ability to “heal” itselfby finding another source for the True File with the given name. It mayalso be desirable to verify that the error has not propagated to othersystems, and to log the problem or indicate it to the computer operator.These details are not described here.

[0471] To verify a data item that is not in a True File registry 126,use the Calculate True Name primitive mechanism described above.

[0472] The basic mechanism begins with a True Name, and operates in thefollowing steps:

[0473] (A) Find the True File registry entry record 140 corresponding tothe given True Name.

[0474] (B) If there is a True File ID for the True File registry entryrecord 140 then use it. Otherwise, indicate that no file exists toverify.

[0475] (C) Calculate the True Name of the data item given the file ID ofthe data item.

[0476] (D) Confirm that the calculated True Name is equal to the givenTrue Name.

[0477] (E) If the True Names are not equal, there is an error in theTrue File registry 126. Remove the True File ID from the True Fileregistry entry record 140 and place it somewhere else. Indicate that theTrue File registry entry record 140 contained an error.

[0478] 8. Track for Accounting Purposes

[0479] This mechanism provides a way to know reliably which files havebeen stored on a system or transmitted from one system to another. Themechanism can be used as a basis for a value-based accounting system inwhich charges are based on the identity of the data stored ortransmitted, rather than simply on the number of bits.

[0480] This mechanism allows the system to track possession of specificdata items according to content by owner, independent of the name, date,or other properties of the data item, and tracks the uses of specificdata items and files by content for accounting purposes. True names makeit possible to identify each file briefly yet uniquely for this purpose.

[0481] Tracking the identities of files requires maintaining anaccounting log 134 and processing it for accounting or billing purposes.The mechanism operates in the following steps:

[0482] (A) Note every time a file is created or deleted, for instance bymonitoring audit entries in the Process Audit File Entry primitivemechanism. When such an event is encountered, create an entry 148 in theaccounting log 134 that shows the responsible party and the identity ofthe file created or deleted.

[0483] (B) Every time a file is transmitted, for instance when a file iscopied with a Request True File remote mechanism or an Acquire True Fileremote mechanism, create an entry in the accounting log 134 that showsthe responsible party, the identity of the file, and the source anddestination processors.

[0484] (C) Occasionally run an accounting program to process theaccounting log 134, distributing the events to the account records ofeach responsible party. The account records can eventually be summarizedfor billing purposes.

[0485] 9. Track for Licensing Purposes

[0486] This mechanism ensures that licensed files are not used byunauthorized parties. The True Name provides a safe way to identifylicensed material. This service allows proof of possession of specificfiles according to their contents without disclosing their contents.

[0487] Enforcing use of valid licenses can be active (for example, byrefusing to provide access to a file without authorization) or passive(for example, by creating a report of users who do not have properauthorization).

[0488] One possible way to perform license validation is to performoccasional audits of employee systems. The service described hereinrelies on True Names to support such an audit, as in the followingsteps:

[0489] (A) For each licensed product, record in the license table 136the True Name of key files in the product (that is, files which arerequired in order to use the product, and which do not occur in otherproducts) Typically, for a software product, this would include the mainexecutable image and perhaps other major files such as clip-art,scripts, or online help. Also record the identity of each system whichis authorized to have a copy of the file.

[0490] (B) Occasionally, compare the contents of each user processoragainst the license table 136. For each True Name in the license tabledo the following:

[0491] (i) Unless the user processor is authorized to have a copy of thefile, confirm that the user processor does not have a copy of the fileusing the Locate True File mechanism.

[0492] (ii) If the user processor is found to have a file that it is notauthorized to have, record the user processor and True Name in a licenseviolation table.

[0493] The System in Operation

[0494] Given the mechanisms described above, the operation of a typicalDP system employing these mechanisms is now described in order todemonstrate how the present invention meets its requirements andcapabilities.

[0495] In operation, data items (for example, files, database records,messages., data segments, data blocks, directories, instances of objectclasses, and the like) in a DP system employing the present inventionare identified by substantially unique identifiers (True Names), theidentifiers depending on all of the data in the data items and only onthe data in the data items. The primitive mechanisms Calculate True Nameand Assimilate Data Item support this property. For any given data item,using the Calculate True Name primitive mechanism, a substantiallyunique identifier or True Name for that data item can be determined.

[0496] Further, in operation of a DP system incorporating the presentinvention, multiple copies of data items are avoided (unless they arerequired for some reason such as backups or mirror copies in afault-tolerant system). Multiple copies of data items are avoided evenwhen multiple names refer to the same data item. The primitivemechanisms Assimilate Data Items and New True File support thisproperty. Using the Assimilate Data Item primitive mechanism, if a dataitem already exists in the system, as indicated by an entry in the TrueFile registry 126, this existence will be discovered by this mechanism,and the duplicate data item (the new data item) will be eliminated (ornot added). Thus, for example, if a data file is being copied onto asystem from a floppy disk, if, based on the True Name of the data file,it is determined that the data file already exists in the system (by thesame or some other name), then the duplicate copy will not be installed.If the data item was being installed on the system by some name otherthan its current name, then, using the Link Path to True Name primitivemechanism, the other (or new) name can be linked to the already existingdata item.

[0497] In general, the mechanisms of the present invention operate insuch a way as to avoid recreating an actual data item at a location whena copy of that data item is already present at that location. In thecase of a copy from a floppy disk, the data item (file) may have to becopied (into a scratch file) before it can be determined that it is aduplicate. This is because only one processor is involved. On the otherhand, in a multiprocessor environment or DP system, each processor has arecord of the True Names of the data items on that processor. When adata item is to be copied to another location (another processor) in theDP system, all that is necessary is to examine the True Name of the dataitem prior to the copying. If a data item with the same True Namealready exists at the destination location (processor), then there is noneed to copy the data item. Note that if a data item which alreadyexists locally at a destination location is still copied to thedestination location (for example, because the remote system did nothave a True Name for the data item or because it arrives as a stream ofun-named data), the Assimilate Data Item primitive mechanism willprevent multiple copies of the data item from being created.

[0498] Since the True Name of a large data item (a compound data item)is derived from and based on the True Names of components of the dataitem, copying of an entire data item can be avoided. Since some (or all)of the components of a large data item may already be present at adestination location, only those components which are not present thereneed be copied. This property derives from the manner in which TrueNames are determined.

[0499] When a file is copied by the Copy File or Directory operatingsystem mechanism, only the True Name of the file is actually replicated.

[0500] When a file is opened (using the Open File operating systemmechanism), it uses the Make True File Local primitive mechanism (eitherdirectly or indirectly through the Create Scratch File primitivemechanism) to create a local copy of the file. The Open File operatingsystem mechanism uses the Make True File Local primitive mechanism,which uses the Realize True File from Location primitive mechanism,which, in turn uses the Request True File remote mechanism.

[0501] The Request True File remote mechanism copies only a single dataitem from one processor to another. If the data item is a compound file,its component segments are not copied, only the indirect block iscopied. The segments are copied only when they are read (or otherwiseneeded).

[0502] The Read File operating system mechanism actually reads data. TheRead File mechanism is aware of compound files and indirect blocks, andit uses the Realize True File from Location primitive mechanism to makesure that component segments are locally available, and then uses theoperating system file mechanisms to read data from the local file.

[0503] Thus, when a compound file is copied from a remote system, onlyits True Name is copied. When it is opened, only its indirect block iscopied. When the corresponding file is read, the required componentsegments are realized and therefore copied.

[0504] In operation data items can be accessed by reference to theiridentities (True Names) independent of their present location. Theactual data item or True File corresponding to a given data identifieror True Name may reside anywhere in the system (that is, locally,remotely, offline, etc). If a required True File is present locally,then the data in the file can be accessed. If the data item is notpresent locally, there are a number of ways in which it can be obtainedfrom wherever it is present. Using the source IDs field of the True Fileregistry table, the location(s) of copies of the True File correspondingto a given True Name can be determined. The Realize True File fromLocation primitive mechanism tries to make a local copy of a True File,given its True Name and the name of a source location (processor ormedia) that may contain the True File. If, on the other hand, for somereason it is not known where there is a copy of the True File, or if theprocessors identified in the source IDs field do not respond with therequired True File, the processor requiring the data item can make ageneral request for the data item using the Request True File remotemechanism from all processors in the system that it can contact.

[0505] As a result, the system provides transparent access to any dataitem by reference to its data identity, and independent of its presentlocation.

[0506] In operation, data items in the system can be verified and havetheir integrity checked. This is from the manner in which True Names aredetermined. This can be used for security purposes, for instance, tocheck for viruses and to verify that data retrieved from anotherlocation is the desired and requested data. For example, the systemmight store the True Names of all executable applications on the systemand then periodically redetermine the True Names of each of theseapplications to ensure that they match the stored True Names. Any changein a True Name potentially signals corruption in the system and can befurther investigated. The Verify Region background mechanism and theVerify True File extended mechanisms provide direct support for thismode of operation. The Verify Region mechanism is used to ensure thatthe data items in the True File registry have not been damagedaccidentally or maliciously. The Verify True File mechanism verifiesthat a data item in a True File registry is indeed the correct data itemgiven its True Name.

[0507] Once a processor has determined where (that is, at which otherprocessor or location) a copy of a data item is in the DP system, thatprocessor might need that other processor or location to keep a copy ofthat data item. For example, a processor might want to delete localcopies of data items to make space available locally while knowing thatit can rely on retrieving the data from somewhere else when needed. Tothis end the system allows a processor to Reserve (and cancel thereservation of) True Files at remote locations (using the remotemechanism). In this way the remote locations are put on notice thatanother location is relying on the presence of the True File at theirlocation.

[0508] A DP system employing the present invention can be made into afault-tolerant system by providing a certain amount of redundancy ofdata items at multiple locations in the system. Using the Acquire TrueFile and Reserve True File remote mechanisms, a particular processor canimplement its own form of fault-tolerance by copying data items to otherprocessors and then reserving them there. However, the system alsoprovides the Mirror True File background mechanism to mirror (makecopies) of the True File available elsewhere in the system. Any degreeof redundancy (limited by the number of processors or locations in thesystem) can be implemented. As a result, this invention maintains adesired degree or level of redundancy in a network of processors, toprotect against failure of any particular processor by ensuring thatmultiple copies of data items exist at different locations.

[0509] The data structures used to implement various features andmechanisms of this invention store a variety of useful information whichcan be used, in conjunction with the various mechanisms, to implementstorage schemes and policies in a DP system employing the invention. Forexample, the size, age and location of a data item (or of groups of dataitems) is provided. This information can be used to decide how the dataitems should be treated. For example, a processor may implement a policyof deleting local copies of all data items over a certain age if othercopies of those data items are present elsewhere in the system. The age(or variations on the age) can be determined using the time of lastaccess or modification in the local directory extensions table, and thepresence of other copies of the data item can be determined either fromthe Safe Flag or the source IDs, or by checking which other processorsin the system have copies of the data item and then reserving at leastone of those copies.

[0510] In operation, the system can keep track of data items regardlessof how those items are named by users (or regardless of whether the dataitems even have names). The system can also track data items that havedifferent names (in different or the same location) as well as differentdata items that have the same name. Since a data item is identified bythe data in the item, without regard for the context of the data, theproblems of inconsistent naming in a DP system are overcome.

[0511] In operation, the system can publish data items, allowing other,possibly anonymous, systems in a network to gain access to the dataitems and to rely on the availability of these data items. True Namesare globally unique identifiers which can be published simply by copyingthem. For example, a user might create a textual representation of afile on system A with True Name N (for instance as a hexadecimalstring), and post it on a computer bulletin board. Another user onsystem B could create a directory entry F for this True Name N by usingthe Link Path to True Name primitive mechanism. (Alternatively, anapplication could be developed which hides the True Name from the users,but provides the same public transfer service.)

[0512] When a program on system B attempts to open pathname F linked toTrue Name N, the Locate Remote File primitive mechanism would be used,and would use the Locate True File remote mechanism to search for TrueName N on one or more remote processors, such as system A. If system Bhas access to system A, it would be able to realize the True File (usingthe Realize True File from Location primitive mechanism) and use itlocally. Alternatively, system B could find True Name N by accessing anypublicly available True Name server, if the server could eventuallyforward the request to system A.

[0513] Clients of a local server can indicate that they depend on agiven True File (using the Reserve True File remote mechanism) so thatthe True File is not deleted from the server registry as long as someclient requires access to it. (The Retire True File remote mechanism isused to indicate that a client no longer needs a given True File.)

[0514] A publishing server, on the other hand, may want to provideaccess to many clients, and possibly anonymous ones, without incurringthe overhead of tracking dependencies for each client. Therefore, apublic server can provide expiration dates for True Files in itsregistry. This allows client systems to safely maintain references to aTrue File on the public server. The Check For Expired Links backgroundmechanism allows the client of a publishing server to occasionallyconfirm that its dependencies on the publishing server are safe.

[0515] In a variation of this aspect of the invention, a processor thatis newly connected (or reconnected after some absence) to the system canobtain a current version of all (or of needed) data in the system byrequesting it from a server processor. Any such processor can send arequest to update or resynchronize all of its directories (starting at aroot directory), simply by using the Synchronize Directories extendedmechanism on the needed directories.

[0516] Using the accounting log or some other user provided mechanism, auser can prove the existence of certain data items at certain times. Bypublishing (in a public place) a list of all True Names in the system ona given day (or at some given time), a user can later refer back to thatlist to show that a particular data item was present in the system atthe time that list was published. Such a mechanism is useful intracking, for example, laboratory notebooks or the like to prove datesof conception of inventions. Such a mechanism also permits proof ofpossession of a data item at a particular date and time.

[0517] The accounting log file can also track the use of specific dataitems and files by content for accounting purposes. For instance, aninformation utility company can determine the data identities of dataitems that are stored and transmitted through its computer systems, anduse these identities to provide bills to its customers based on theidentities of the data items being transmitted (as defined by thesubstantially unique identifier). The assignment of prices for storingand transmitting specific True Files would be made by the informationutility and/or its data suppliers; this information would be joinedperiodically with the information in the accounting log file to producecustomer statements.

[0518] Backing up data items in a DP system employing the presentinvention can be done based on the True Names of the data items. Bytracking backups using True Names, duplication in the backups isprevented. In operation, the system maintains a backup record of dataidentifiers of data items already backed up, and invokes the Copy Fileor Directory operating system mechanism to copy only those data itemswhose data identifiers are not recorded in the backup record. Once adata item has been backed up, it can be restored by retrieving it fromits backup location, based on the identifier of the data item. Using thebackup record produced by the backup to identify the data item, the dataitem can be obtained using, for example, the Make True File Localprimitive mechanism.

[0519] In operation, the system can be used to cache data items from aserver, so that only the most recently accessed data items need beretained. To operate in this way, a cache client is configured to have alocal registry (its cache) with a remote Local Directory Extensionstable (from the cache server). Whenever a file is opened (or read), theLocal Directory Extensions table is used to identify the True Name, andthe Make True File Local primitive mechanism inspects the localregistry. When the local registry already has a copy, the file isalready cached. Otherwise, the Locate True File remote mechanism is usedto get a copy of the file. This mechanism consults the cache server anduses the Request True File remote mechanism to make a local copy,effectively loading the cache.

[0520] The Groom Cache background mechanism flushes the cache, removingthe least-recently-used files from the cache client's True Fileregistry. While a file is being modified on a cache client, the LockCache and Update Cache remote mechanisms prevent other clients fromtrying to modify the same file.

[0521] In operation, when the system is being used to cache data items,the problems of maintaining cache consistency are avoided.

[0522] To access a cache and to fill it from its server, a key isrequired to identify the data item desired. Ordinarily, the key is aname or address (in this case, it would be the pathname of a file). Ifthe data associated with such a key is changed, the client's cachebecomes inconsistent; when the cache client refers to that name, it willretrieve the wrong data. In order to maintain cache consistency it isnecessary to notify every client immediately whenever a change occurs onthe server.

[0523] By using an embodiment of the present invention, the cache keyuniquely identifies the data it represents. When the data associatedwith a name changes, the key itself changes. Thus, when a cache clientwishes to access the modified data associated with a given file name, itwill use a new key (the True Name of the new file) rather than the keyto the old file contents in its cache. The client will always requestthe correct data, and the old data in its cache will be eventually agedand flushed by the Groom Cache background mechanism.

[0524] Because it is not necessary to immediately notify clients whenchanges on the cache server occur, the present invention makes itpossible for a single server to support a much larger number of clientsthan is otherwise possible.

[0525] In operation, the system automatically archives data items asthey are created or modified. After a file is created or modified, theClose File operating system mechanism creates an audit file record,which is eventually processed by the Process Audit File. Entry primitivemechanism. This mechanism uses the New True File primitive mechanism forany file which is newly created, which in turn uses the Mirror True Filebackground mechanism if the True File is in a mirrored or archivedregion. This mechanism causes one or more copies of the new file to bemade on remote processors.

[0526] In operation, the system can efficiently record and preserve anycollection of data items. The Freeze Directory primitive mechanismcreates a True File which identifies all of the files in the directoryand its subordinates. Because this True File includes the True Names ofits constituents, it represents the exact contents of the directory treeat the time it was frozen. The frozen directory can be copied with itscomponents preserved.

[0527] The Acquire True File remote mechanism (used in mirroring andarchiving) preserves the directory tree structure by ensuring that allof the component segments and True Files in a compound data item areactually copied to a remote System. Of course, no transfer is necessaryfor data items already in the registry of the remote system.

[0528] In operation, the system can efficiently make a copy of anycollection of data items, to support a version control mechanism forgroups of the data items.

[0529] The Freeze Directory primitive mechanism is used to create acollection of data items. The constituent files and segments referred toby the frozen directory are maintained in the registry, without any needto make copies of the constituents each time the directory is frozen.

[0530] Whenever a pathname is traversed, the Get Files in Directoryoperating system mechanism is used, and when it encounters a frozendirectory, it uses the Expand Frozen Directory primitive mechanism.

[0531] A frozen directory can be copied from one pathname to anotherefficiently, merely by copying its True Name. The Copy File operatingsystem mechanism is used to copy a frozen directory.

[0532] Thus it is possible to efficiently create copies of differentversions of a directory, thereby creating a record of its history (hencea version control system).

[0533] In operation, the system can maintain a local inventory of allthe data items located on a given removable medium, such as a disketteor CD-ROM. The inventory is independent of other properties of the dataitems such as their name, location, and date of creation.

[0534] The Inventory Existing Directory extended mechanism provides away to create True File Registry entries for all of the files in adirectory. One use of this inventory is as a way to pre-load a True Fileregistry with backup record information. Those files in the registry(such as previously installed software) which are on the volumesinventoried need not be backed up onto other volumes.

[0535] The Inventory Removable, Read-only Files extended mechanism notonly determines the True Names for the files on the medium, but alsorecords directory entries for each file in a frozen directory structure.By copying and modifying this directory, it is possible to create an online patch, or small modification of an existing read-only file. Forexample, it is possible to create an online representation of a modifiedCD-ROM, such that the unmodified files are actually on the CD-ROM, andonly the modified files are online.

[0536] In operation, the system tracks possession of specific data itemsaccording to content by owner, independent of the name, date, or otherproperties of the data item, and tracks the uses of specific data itemsand files by content for accounting purposes. Using the Track forAccounting Purposes extended mechanism provides a way to know reliablywhich files have been stored on a system or transmitted from one systemto another.

[0537] True Names in Relational and Object-Oriented Databases

[0538] Although the preferred embodiment of this invention has beenpresented in the context of a file system, the invention of True Nameswould be equally valuable in a relational or object-oriented database. Arelational or object-oriented database system using True Names wouldhave similar benefits to those of the file system employing theinvention. For instance, such a database would permit efficientelimination of duplicate records, support a cache for records, simplifythe process of maintaining cache consistency, providelocation-independent access to records, maintain archives and historiesof records, and synchronize with distant or disconnected systems ordatabases.

[0539] The mechanisms described above can be easily modified to serve insuch a database environment. The True Name registry would be used as arepository of database records. All references to records would be viathe True Name of the record. (The Local Directory Extensions table is anexample of a primary index that uses the True Name as the uniqueidentifier of the desired records.)

[0540] In such a database, the operations of inserting, updating, anddeleting records would be implemented by first assimilating records intothe registry, and then updating a primary key index to map the key ofthe record to its contents by using the True Name as a pointer to thecontents.

[0541] The mechanisms described in the preferred embodiment, or similarmechanisms, would be employed in such a system. These mechanisms couldinclude, for example, the mechanisms for calculating true names,assimilating, locating, realizing, deleting, copying, and moving TrueFiles, for mirroring True Files, for maintaining a cache of True Files,for grooming True Files, and other mechanisms based on the use ofsubstantially unique identifiers.

[0542] While the invention has been described in connection with what ispresently considered to be the most practical and preferred embodiments,it is to be understood that the invention is not to be limited to thedisclosed embodiment, but on the contrary, is intended to cover variousmodifications and equivalent arrangements included within the spirit andscope of the appended claims.

What is claimed is:
 1. In a data processing system, an apparatuscomprising: identity means for determining, for any of a plurality ofdata items in the system, a substantially unique identifier, saididentifier depending on all of the data in the data item and only on thedata in the data item; and existence means for determining whether aparticular data item is present in the system, by examining theidentifiers of the plurality of data items.
 2. An apparatus as in claim1, further comprising: local existence means for determining whether aninstance of a particular data item is present at a particular locationin the system, based on the identifier of the data item.
 3. An apparatusas in claim 2, wherein each location contains a distinct plurality ofdata items, and wherein said local existence means determines whether aparticular data item is present at a particular location in the systemby examining the identifiers of the plurality of data items at saidparticular location in the system.
 4. An apparatus as in claim 2,further comprising: data associating means for making and maintaining,for a data item in the system, an association between the data item andthe identifier of the data item; and access means for accessing aparticular data item using the identifier of the data item.
 5. Anapparatus as in claim 2, further comprising: duplication means forcopying a data item from a source to a destination in the dataprocessing system, by providing said destination with the data item onlyif it is determined using the data identifier that the data item is notpresent at the destination.
 6. An apparatus as in claim 4, furthercomprising: assimilation means for assimilating a new data item into thesystem, said assimilation means invoking said identity means todetermine the identifier of the new data item and invoking said dataassociating means to associate the new data item with its identifier. 7.An apparatus as in claim 4, further comprising: duplication means forduplicating a data item from a source location to a destination locationin the data processing system, based on the identifier of the data item,said duplication means invoking said local existence means to determinewhether an instance of the data item is present at the destinationlocation, and invoking said access means to provide said destinationwith the data item only if said local existence means determines that noinstance of the data item is present at the destination.
 8. An apparatusas in claim 7, further comprising: backup means for making copies ofdata items in the system, said backup means maintaining a backup recordof identifiers of data items backed up, and invoking duplication meansto copy only those data items whose data identifiers are not recorded inthe backup record.
 9. An apparatus as in claim 8, further comprising:recovery means for retrieving a data item previously backed up by saidbackup means, based on the identifier of the data item, said recoverymeans using the backup record to identify the data item, and invokingaccess means to retrieve the data item.
 10. An apparatus as in claim 2,wherein a location is a computer among a network of computers, theapparatus further comprising: remote existence means for determiningwhether a data item is present at a remote location in the system from acurrent location in the system, based on the identifier of the dataitem, said remote location using local existence means at the remotelocation to determine whether the data item is present at the remotelocation, and providing the current location with an indication of thepresence of the data item at the remote location.
 11. An apparatus as inclaim 4, wherein a location is a computer among a network of computers,the apparatus further comprising: requesting means for requesting a dataitem at a current location in the system from a remote location in thesystem, based on the identifier of the data item, said remote locationusing access means at the remote location to obtain the data item and tosend it to the current location if it is present.
 12. An apparatus as inclaim 1, further comprising: context means for making and maintaining acontext association between at least one contextual name of a data itemin the system and the identifier of the data item; and referencing meansfor obtaining the identifier of a data item in the system given acontextual name for the data item, using said context association. 13.An apparatus as in claim 12, further comprising: assignment means forassigning a data item to a contextual name, invoking said identity meansto determine the identifier of the data item, and invoking said contextmeans to make or modify the context association between the contextualname of the data item and the identifier of the data item.
 14. Anapparatus as in claim 12, further comprising: data associating means formaking and maintaining, for a data item in the system, an associationbetween the data item and the identifier of the data item; access meansfor accessing a particular data item using the identifier of theparticular data item; and contextual name access means for accessing adata item in the system for a given context name of the data item,determining the data identifier associated with the given context name,and invoking said access means to access the data item using the dataidentifier.
 15. An apparatus as in claim 11, further comprising:transparent access means for accessing a data item from one of severallocations, using the identifier of the data item, said transparentaccess means invoking said local existence means to determine if theparticular data item is present at the current location, and, in thecase when the particular data item is not present at the currentlocation, invoking said requesting means to obtain the data item from aremote location.
 16. An apparatus as in claim 15, further comprising:identifier copy means for copying an identifier of a data item from asource location to a destination location.
 17. An apparatus as in claim15, further comprising: context means for making and maintaining acontext association between a contextual name of a data item in thesystem and the identifier of the data item; context copy means forcopying a data item from a source location to a destination location,given the contextual name of the data item, by copying only the contextassociation between the contextual identifier and the data identifierfrom the source location to the destination location; and transparentreferencing means for obtaining a data item from one of severallocations the system given a contextual name for the data item, saidtransparent referencing means invoking said context association todetermine the data identifier of a data item given a contextual name,and invoking said transparent access means to access the data item fromone of several locations given the identifier of the data item.
 18. Anapparatus as in claim 1, wherein at least some of said data items arecompound data items, each compound data item including at least somecomponent data items in a fixed sequence, and wherein the identity meansdetermines the identifier of a compound data item based on eachcomponent data item of the compound data item.
 19. An apparatus as inclaim 18, wherein said compound data items are files and said componentdata items are segments, and wherein the identity means determines theidentifier of a file based on the identifier of each data segment of thefile.
 20. An apparatus as in claim 18, wherein said compound data itemsare directories and said component data items are files or subordinatedirectories, and wherein the identity means determines the identifier ofa given directory based on each file and subordinate directory withinthe given directory.
 21. An apparatus as in claim 11, furthercomprising: means for advertising a data item from a location in thesystem to at least one other location in the system, said means foradvertising providing each of said at least one other location with thedata identifier of the data item, and providing the data item to onlythose locations of said other locations that request said data item inresponse to said providing.
 22. An apparatus as in claim 18, furthercomprising: local existence means for determining whether a particulardata item is present at a particular location in the system, based onthe identifier of the data item; and compound copy means for copying adata item from a source to a destination in the data processing system,said compound copy means invoking said local existence means todetermine whether the data item is present at the destination, and todetermine, when the data item is a compound data item, whether thecomponent data items of the compound data item are present at thedestination, and providing said destination with the data item only ifsaid local existence means determines that the data item is not presentat the destination, and providing said destination with each componentdata item only if said local existence means determines that thecomponent data item is not present at the destination.
 23. An apparatusas in claim 11, further comprising: means for verifying the integrity adata item obtained from said requesting means in response to providingsaid requesting with a particular data identifier, to confirm that thedata item obtained from the requesting means is the same data item asthe data item requested, said verifying means invoking said identitymeans to determine the data identifier of the obtained data item, andcomparing said determined data identifier with said particular dataidentifier to verify said obtained data item.
 24. An apparatus as inclaim 2, wherein a location is at least one of a storage location and aprocessing location, and wherein a storage location is at least one of adata storage device and a data storage volume, and wherein a processinglocation is at least one of a data processor and a computer.
 25. Anapparatus as in claim 3, wherein at least some of said data items arecompound data items, each compound data item including at least somecomponent data items in a fixed sequence, and wherein the identity meansdetermines the identifier of a compound data item based on theidentifier of each component data item of the compound data item.
 26. Anapparatus as in claim 3, further comprising: context associating meansfor making and maintaining a context association, for any data item inthe system, between the identifier of the data item and at least onecontextual name of the data item at a particular location in the system;means for obtaining the identifier of a data item in the system given acontextual name for the data item at a particular location in thesystem; and logical copy means for associating the data identifiercorresponding to a contextual name at a source location with acontextual name at a destination location in the data processing system.27. An apparatus as in claim 25, wherein said compound data items arefiles and said component data items are segments, and wherein theidentity means determines the identifier of a file based on theidentifier of each data segment of the file.
 28. An apparatus as inclaim 25, further comprising: compound copy means for copying a dataitem from a source location to a destination location in the dataprocessing system, said compound copy means invoking said localexistence means to determine whether the data item is present at thedestination, and to determine, when the data item is a compound dataitem, whether the component data items of the compound data item arepresent at the destination, and providing said destination with the dataitem only if said local existence means determines that the data item isnot present at the destination, and providing said destination with eachcomponent data item only if said local existence means determines thatthe component data item is not present at the destination.
 29. Anapparatus as in any of claims 1-28, wherein a data item is at least oneof a file, a database record, a message, a data segment, a data block, adirectory, and an instance an object class.
 30. A method of identifyinga data item in a data processing system for subsequent access to thedata item, the method comprising the steps of: determining asubstantially unique identifier for the data item, said identifierdepending on all of the data in the data item and on the data in thedata item; and accessing a data item in the system using the identifierof the data item.
 31. A method as in claim 30, further comprising thestep of: making and maintaining, for a plurality of data items in thesystem, an association between each of the data items and the identifierof each of the data items, wherein said accessing step accesses a dataitem via the association.
 32. A method as in claim 31, furthercomprising the step of: assimilating a new data item into the system, bydetermining the identifier of the new data item and associating the newdata item with its identifier.
 33. A method for duplicating a given dataitem from a source location to a destination location in a dataprocessing system, the method comprising the steps of: determining asubstantially unique identifier for the given data item, said identifierdepending on all of the data in the data item and only on the data inthe data item; determining, using said data identifier, whether saiddata item is present at said destination location; and based on saiddetermining, providing said destination location with said data itemonly if said data item is not present at said destination.
 34. A methodas in claim 33, wherein said given data item is a compound data itemhaving a plurality of component data items, the method furthercomprising the steps of: for each data item of said component dataitems, obtaining the component data identifier of the data item bydetermining a substantially unique identifier for the data item, saididentifier depending on all of the data in the data item and only on thedata in the data item; determining, using said obtained component dataidentifier, whether said data item is present at said destination; andbased on said determining, providing said destination with said dataitem only if said data item is not present at said destination.
 35. Amethod for determining whether a particular data item is present in adata processing system, the method comprising the steps of: (A) for eachdata item of a plurality of data items in the systems (i) determining asubstantially unique identifier for the data item, said identifierdepending on all of the data in the data item and only on the data inthe data item; and (ii) making and maintaining a set of identifiers ofsaid plurality of data items; and (B) for the particular data item, (i)determining a particular substantially unique identifier for the dataitem, said identifier depending on all of the data in the data item andonly on the data in the data item; and (ii) determining whether saidparticular identifier is in said set of data items.
 36. A method ofbacking up, of a plurality of data items, data items modified since aprevious backup time in a data processing system, the method comprisingthe steps of: (A) maintaining a backup record of identifiers of dataitems backed up at the previous backup time; and (B) for each of saidplurality of data items, (i) determining a substantially uniqueidentifier for the data item, said identifier depending on all of thedata in the data item and only on the data in the data item; (ii)determining those data items of the plurality of data items whoseidentifiers are not in the backup record; and (iii) based on saiddetermining, copying only those data items whose data identities are notrecorded in the backup record.
 37. A method as in claim 36, furthercomprising the step of: recording in the backup record the identifiersof those data items copied in said step of copying.
 38. A method oflocating a particular data item at a location in a data processingsystem, the method comprising the steps of: (A) determining asubstantially unique identifier for the data item, said identifierdepending on all of the data in the data item and only on the data inthe data item; (B) requesting the particular data item by sending thedata identifier of the data item from the requestor location to at leastone location of a plurality of provider locations in the system; and (C)on at least some of said provider locations, (a) for each data item of aplurality of data items at said provider locations, (i) determining asubstantially unique identifier for the data item, said identifierdepending on all of the data in the data item and only on the data inthe data item; and (ii) making and maintaining a set of identifiers ofdata items, (b) determining, based on said set of identifiers, whetherthe data item corresponding to the requested data identifier is presentat said provider location; and (c) based on said determining, when saidprovider location determines that the particular data item is present atthe provider location, notifying said requestor that the provider has acopy of the given data item.
 39. The method of claim 38, furthercomprising the steps of: (a) for each data item of a plurality of dataitems at said provider locations, making and maintaining an associationbetween the data item and the identifier of the data item, (b) inresponse to said notifying, said client location copying said data itemfrom one of said responding remote locations, using said association toaccess the data item given the data identifier.
 40. A method of locatinga particular data item among a plurality of locations, each of saidlocations having a plurality of data items, the method comprising thesteps of: determining, for the particular data item and for each dataitem of the plurality of data items, a substantially unique identifierfor the data item, said identifier depending on all of the data in thedata item and only on the data in the data item; and determining thepresence of the particular data item in each of said plurality oflocations by determining whether the identifier of the particular dataitem is present at each of said locations.
 41. The method of claim 30,wherein said step of accessing further comprises the steps of, for agiven data identifier and for a given current location and a remotelocation in the system: determining whether the data item correspondingto the given data identifier is present at the current location, andbased on said determining, if said data item is not present at thecurrent location, fetching the data item from a remote location in thesystem to the current location.
 42. The method of claim 41, furthercomprising the steps of: for each contextual name at a location, makingand maintaining a context association between the context name of a dataitem and the identifier of said data item, and when some contextassociation changes at said current location, and notifying said remotelocation of a modification to the context association.
 43. The method ofclaim 42, further comprising the step of: at said remote location,updating the association between the contextual identifier of the dataitem and the identifier of the data item.
 44. The method of claim 43,further comprising the step of: from said remote location, notifying allother locations that said data item has been modified, by providing thecontextual identifier and data identifier of said data item to saidother locations.
 45. The method of claim 44, further comprising the stepof, at each location notified that the data item has been modified:modifying an association between the contextual identifier of the dataitem and the data identifier of the data item, to record that the dataitem has been modified.
 46. A method of eliminating a data item at agiven location in a data processing system when said data item can beobtained from another location in the system, the method comprising thesteps of: determining a substantially unique identifier for the data,said identifier depending on all of the data in the data item and onlyon the data in the data item; making and maintaining a sourceassociation between the data identifier and at least one location atwhich said data item is known to be present; and based on said sourceassociation, if said data item is present at said other location,removing the data item from the given location.
 47. A method of deletinga data item from a location in a data processing system, the methodcomprising the steps of: for each of a plurality of data items in thesystem: determining a substantially unique identifier for the data, saididentifier depending on all of the data in the data item and only on thedata in the data item; and making and maintaining, an associationbetween each of the data items and the unique identifier of the dataitems; and for a given data item: determining a substantially uniqueidentifier for the data, said identifier depending on all of the data inthe data item and only on the data in the data item; and determiningwhether a contextual identifier or a compound data item or a remoteprocessor in the system refers to the unique identifier of the dataitem, and based on said determining, deleting said data item and itsassociation if no other contextual identifier or compound data item orremote processor refers to said data item.
 48. The method of claim 47,wherein said determining is based on a use count for the data item, andwherein said data item is deleted only if said use count indicates thatno other contextual identifier or compound data item or remote processorin the system refers to the data item.
 49. A method of substantiallysynchronizing data items at a client location in a data processingsystem after a period of independent changes on the client and anotherlocation in the system, given a context, the method comprising the stepsof: making and maintaining a list of changes to the context associationbetween each context name of a data item and the identifier of said dataitem, in the given context and during the period of independent change;obtaining the list of changes from the other location for the givencontext; and, for each context name in the list of changes updating thecontext identifier associations at the client whenever it is determinedthat the context association of the given context name changed eitheronly at the client or only at the other location during the period ifindependent changes; and performing a conflict-resolution task such asnotifying an operator of the client location, whenever it is determinedthat the context association changed at both the client and the otherlocation.
 50. A method as in claim 49, wherein said lists are maintainedas queues based on a temporal order, and wherein, at said clientlocation, said replacing is based on said temporal order.
 51. A methodof maintaining at least a predetermined number of copies of a given dataitem in a data processing system, at different locations in the dataprocessing system, said data processing system being one wherein data isidentified by a substantially unique identifier, said identifierdepending on all of the data in the data item and only on the data inthe data item, and wherein any data item in the system may be accessedusing only the identifier of the data item, the method comprising thesteps of: (i) sending, from a first location in the system, the dataidentifier of the given data item to other locations in the system; and(ii) in response to said sending, at each of said other locations, (A)determining whether the data item corresponding to the data identifieris present at the other location, and based on said determining, and (B)informing said first location whether said data item is present at theother location; and (iii) in response to said informing from said otherlocations, at said first location, (A) determining whether said dataitem is present in at least the predetermined number of other locations,and based on said determining, (B) when less than the predeterminednumber of other locations have a copy of the data item, requesting somelocations that do not have a copy of the data item make a copy of thedata item.
 52. A method as in claim 51, wherein said step (iii) furthercomprises the step of: (C) when more than the predetermined number ofother locations have a copy of the data item, requesting some locationsthat do have a copy of the data item delete the copy of the data item.53. A method as in any of claims 30-52, wherein said data items are atleast one of a file, a database record, a message, a data segment, adata block, a directory, and an instance of an object class.